Configuring Apache Hive

BAM leverages Apache Hadoop for running analytics which can be scaled to handle large data volumes and uses Apache Hive for specifying and submitting analytic jobs. Apache Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large data sets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using an SQL-like language called HiveQL.

For more information on Hive and Hadoop, refer to the following:

Hive Wiki: https://cwiki.apache.org/confluence/display/Hive/Home
HiveQL Documentation: https://cwiki.apache.org/confluence/display/Hive/LanguageManual
Hadoop Wiki: http://wiki.apache.org/hadoop

All Hive related configurations are included in the following files.

For Linux : ${BAM_HOME}/repository/conf/advanced/hive-site.xml
For Windows : ${BAM_HOME}/repository/conf/advanced/hive-site-win.xml

Hadoop Configuration

By default, Hive submits analytic jobs to a Hadoop instance running in local mode (For more information on Hadoop execution modes, refer to Hadoop Wiki link given above). In order to point to an external Hadoop cluster, the configuration needs to be modified as follows.

Comment out/remove following properties as specified in the configuration file.

<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>file://${CARBON_HOME}/repository/data/hive/warehouse</value>
  <description>location of default database for the warehouse</description>
</property>
<property>
  <name>fs.default.name</name>
  <value>file://${CARBON_HOME}/data/hive</value>
</property>
<property>
  <name>mapred.job.tracker</name>
  <value>local</value>
</property>

Add following properties to the configuration.

<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/user/hive/warehouse</value>
  <description>location of default database for the warehouse</description>
</property>
<property>
  <name>hadoop.embedded.local.mode</name>
  <value>false</value>
</property>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
</property>
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:9001</value>
</property>

Change the URLs of mapred.job.tracker and fs.default.name properties to point to external Hadoop Job Tracker and Namenode respectively.

Configuring Hive Metastore

Hive uses a RDBMS data store to persist Hive table definitions and other metadata. By default, Hive uses an embedded H2 instance bundled within BAM to persist these data. In production setups, it is recommended to switch this to a separate database instance such as MySQL, Oracle etc. Change the following properties to point to the external database instance.

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:h2://${CARBON_HOME}/repository/database/metastore_db</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.h2.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>wso2carbon</value>
  <description>username to use against metastore database</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>wso2carbon</value>
  <description>password to use against metastore database</description>
</property>

For databases other than H2, copy the database driver to ${PROUCT_HOME}/repository/components/lib and restart the server with the above configuration for the changes to take effect.

Make sure that you have transferred the existing table meta information from the existing meta store database to the new database. Else, there will be errors during Hive script execution since Hive is unable to find meta information in the new database.