This section introduces you to Apache Hive query language (HQL) and how to set up databases and write Hive scripts to process and analyze the data stored in RDBMS or noSQL data bases. All Hive-related configurations in BAM are included in the following files.
- For Linux : <BAM_HOME>/repository/conf/advanced/hive-site.xml
- For Windows : <BAM_HOME>/repository/conf/advanced/hive-site-win.xml
The following topics are covered:
Table of Contents | ||||
---|---|---|---|---|
|
Hive queries in BAM
Apache Hive is a data warehouse system for Hadoop. It facilitates easy data summarization, ad-hoc queries, and the analysis of large data sets stored in Hadoop-compatible file systems. Hive provides a mechanism to project structure onto data, and query the data using an SQL-like language called HiveQL.
This is not a complete Hive learning resource. We describe here only selected Hive syntax used in WSO2 BAM. If you are a beginner, we recommend you to also see other materials on Hive, including the online Apache Hive project documentation. Given below are good starting points:
- Hive Language Manual : https://cwiki.apache.org/confluence/display/Hive/LanguageManual
- Hive Tutorial : https://cwiki.apache.org/confluence/display/Hive/Tutorial
Business information processed by BAM is ideally stored in some RDBMS or noSQL data store. Few examples are as follows:
- RDBMS tables : For example, H2 database table, MySQL table.
- noSQL column families : For example, Cassandra column families.
We use Hive scripts to analyze collected data and derive meaningful information from them before storing in these RDBMS or noSQL data stores. But, Hive queries are only compatible with Hive tables. Therefore, before writing a Hive script on either a RDBMS table or Cassandra column family, you should create virtual Hive tables based on the actual RDBMS or noSQL data stores.
When you run a Hive script, it refers to the RDBMS table structures or column families where the processed information should be stored, and defines its own virtual tables by wrapping the existing physical tables. This virtual table structure is similar to SQL tables defined in RDBMS. For example, when Hive creates a virtual table by wrapping a MySQL table, the Hive table gets a set of fields corresponding to each data field in the MySQL table. You should individually map all the data fields you want to define in the Hive table with the MySQL table. Unmapped fields will not be available and visible in the Hive table.
The Hive engine maintains Hive tables by keeping a metadata set related to each table or column family in the actual table. Therefore, updating or deleting the actual table or column family using a third-party software will not update the metadata kept in the virtual Hive table. We recommend you to perform these operations on an SQL table. There is no need to create an SQL database using a separate tool or software, because creating a Hive table with the Hive scripts given below will automatically create the required SQL tables for you.
Creating Hive tables for various data sources
BAM uses storage handlers to connect to various data sources from Hive. Currently, there are two main storage handlers in BAM to connect to Cassandra and relational data stores. They are as followsThis section introduces you to Apache Hive query language (HQL) and how to set up databases and write Hive scripts to process and analyze the data stored in RDBMS or noSQL data bases. All Hive-related configurations in BAM are included in the following files.
- For Linux : <BAM_HOME>/repository/conf/advanced/hive-site.xml
- For Windows : <BAM_HOME>/repository/conf/advanced/hive-site-win.xml
The following topics are covered:
Table of Contents | ||||
---|---|---|---|---|
|
Hive queries in BAM
Apache Hive is a data warehouse system for Hadoop. It facilitates easy data summarization, ad-hoc queries, and the analysis of large data sets stored in Hadoop-compatible file systems. Hive provides a mechanism to project structure onto data, and query the data using an SQL-like language called HiveQL.
This is not a complete Hive learning resource. We describe here only selected Hive syntax used in WSO2 BAM. If you are a beginner, we recommend you to also see other materials on Hive, including the online Apache Hive project documentation. Given below are good starting points:
- Hive Language Manual : https://cwiki.apache.org/confluence/display/Hive/LanguageManual
- Hive Tutorial : https://cwiki.apache.org/confluence/display/Hive/Tutorial
Business information processed by BAM is ideally stored in some RDBMS or noSQL data store. Few examples are as follows:
- RDBMS tables : For example, H2 database table, MySQL table.
- noSQL column families : For example, Cassandra column families.
We use Hive scripts to analyze collected data and derive meaningful information from them before storing in these RDBMS or noSQL data stores. But, Hive queries are only compatible with Hive tables. Therefore, before writing a Hive script on either a RDBMS table or Cassandra column family, you should create virtual Hive tables based on the actual RDBMS or noSQL data stores.
When you run a Hive script, it refers to the RDBMS table structures or column families where the processed information should be stored, and defines its own virtual tables by wrapping the existing physical tables. This virtual table structure is similar to SQL tables defined in RDBMS. For example, when Hive creates a virtual table by wrapping a MySQL table, the Hive table gets a set of fields corresponding to each data field in the MySQL table. You should individually map all the data fields you want to define in the Hive table with the MySQL table. Unmapped fields will not be available and visible in the Hive table.
The Hive engine maintains Hive tables by keeping a metadata set related to each table or column family in the actual table. Therefore, updating or deleting the actual table or column family using a third-party software will not update the metadata kept in the virtual Hive table. We recommend you to perform these operations on an SQL table. There is no need to create an SQL database using a separate tool or software, because creating a Hive table with the Hive scripts given below will automatically create the required SQL tables for you.
Creating Hive tables for various data sources
BAM uses storage handlers to connect to various data sources from Hive. Currently, there are two main storage handlers in BAM to connect to Cassandra and relational data stores. They are as follows:
...
Code Block |
---|
CREATE EXTERNAL TABLE IF NOT EXISTS ActivityDataTable (messageID STRING, sentTimestamp BIGINT, activityID STRING, version STRING, soapHeader STRING, soapBody STRING, host STRING) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( "cassandra.hostwso2.carbon.datasource.name" = "127.0.0.1" , WSO2BAM_CASSANDRA_DATASOURCE", "cassandra.portpartitioner" = "9160" , org.apache.cassandra.dht.RandomPartitioner", "cassandra.kscf.name" = "EVENT_KSorg_wso2_bam_activity_monitoring" , "cassandra.kscolumns.usernamemapping" = "admin" , "cassandra.ks.password" = "admin" , "cassandra.cf.name" = "org_wso2_bam_activity_monitoring" , "cassandra.columns.mapping" = "::key, payload_timestamp, correlation_bam_activity_id, Version, payload_SOAPHeader, payload_SOAPBody, meta_host" ); |
Note the following regarding the above query:
- The new handler org.apache.hadoop.hive.cassandra.CassandraStorageHandler is used instead of the JDBC handler class.
- WITH SERDEPROPERTIES is used instead of TBLPROPERTIES command.
- The Cassandra storage handler class takes the following parameters for its SerDe properties.
- cassandra.host : Host names/IP Addresses of Cassandra nodes. You can use a comma-separated list for multiple nodes in the Cassandra ring for fail-over.
- cassandra.port : The port through which Cassandra listens to client requests.
- cassandra.ks.name : Cassandra Keyspace name. Keyspaces are logically similar to databases in RDBMS. The connection parameters, host, port, username and password are declared explicitly in cassandra.host, cassandra.port, cassandra.ks.username and cassandra.ks.password respectively. The name of the keyspace is EVENT_KS by default. To change this, edit the <keySpaceaName> element in <BAM_HOME>/repository/conf/ data-bridges/data-bridge- config.xml file.
- cassandra.ks.username : Username (username@tenant_domain if in Stratos) for authenticating Cassandra Keyspace. If no authentication is required to the Keyspace to be connected, you can skip this.
- cassandra.ks.password : Password for authenticating the Cassandra Keyspace. If no authentication is required to the Keyspace to be connected, you can skip this.
- cassandra.cf.name : Cassandra ColumnFamily name. In this example, org_wso2_bam_activity_monitoring is set as the column family name.
- cassandra.columns.mapping : Used to map the Cassandra column family keys to the Hive table fields. Should be in the same order as the Hive field definitions in CREATE TABLE. So the Hive table fields messageID, sentTimestamp, activityID, version, soapHeader, soapBody and host are mapped to the column family keys (keys of key-value pairs) by the names :key, payload_timestamp, correlation_bam_activity_id, Version, payload_SOAPHeader, payload_SOAPBody and meta_host. The reason is because the column family is already created, and the Hive script only creates the mapped Hive table onto the existing column family. :key is the unique row key available for each row in the Cassandra column family. You should map this field with a Hive table field in every Hive script.
...
Example 4: This sample query maps the entire ColumnFamily as key value pairs (Transposed table). It maps all column names to a field in the Hive table and corresponding column values to another. This is useful for time series data stored within Cassandra as wide rows.
Code Block | ||
---|---|---|
| ||
Hive_Schema = CREATE TABLE PhoneSalesTimeSeriesTable (itemId STRING, quarter STRING, salesFigure INT)
cassandra.columns.mapping = ":key, :column, :value" |
You can map a Cassandra super column family as follows.
Code Block | ||
---|---|---|
| ||
Hive_Schema = CREATE TABLE PhoneSalesTimeSeriesTable (itemId STRING, quarter STRING, month STRING, salesFigure INT)
cassandra.columns.mapping = ":key, :column, :subcolumn, :value" |
...
icon | false |
---|
...
pairs (Transposed table). It maps all column names to a field in the Hive table and corresponding column values to another. This is useful for time series data stored within Cassandra as wide rows.
Code Block | ||
---|---|---|
| ||
Hive_Schema = CREATE TABLE PhoneSalesTimeSeriesTable (itemId STRING, quarter STRING, salesFigure INT)
cassandra.columns.mapping = ":key, :column, :value" |
You can map a Cassandra super column family as follows.
Code Block | ||
---|---|---|
| ||
Hive_Schema = CREATE TABLE PhoneSalesTimeSeriesTable (itemId STRING, quarter STRING, month STRING, salesFigure INT)
cassandra.columns.mapping = ":key, :column, :subcolumn, :value" |
Note | ||
---|---|---|
| ||
Other than the Map data type, BAM does not support any other complex Hive data type with Cassandra Storage Handler at the moment. |
Info | |||||
---|---|---|---|---|---|
If you are using a Cassandra version above 1.2.10, you need to add the following element in Hive queries you write:
|
Using a datasource name for Cassandra storage handler in Hive scripts
...