This section introduces you to Apache Hive query language (HQL) and how to set up databases and write Hive scripts to process and analyze the data stored in RDBMS or noSQL data bases. All Hive-related configurations in BAM are included in the following files.
- For Linux : <BAM_HOME>/repository/conf/advanced/hive-site.xml
- For Windows : <BAM_HOME>/repository/conf/advanced/hive-site-win.xml
The following topics are covered:
Table of Contents | ||||
---|---|---|---|---|
|
Hive queries in BAM
Apache Hive is a data warehouse system for Hadoop. It facilitates easy data summarization, ad-hoc queries, and the analysis of large data sets stored in Hadoop-compatible file systems. Hive provides a mechanism to project structure onto data, and query the data using an SQL-like language called HiveQL.
This is not a complete Hive learning resource. We describe here only selected Hive syntax used in WSO2 BAM. If you are a beginner, we recommend you to also see other materials on Hive, including the online Apache Hive project documentation. Given below are good starting points:
- Hive Language Manual : https://cwiki.apache.org/confluence/display/Hive/LanguageManual
- Hive Tutorial : https://cwiki.apache.org/confluence/display/Hive/Tutorial
Business information processed by BAM is ideally stored in some RDBMS or noSQL data store. Few examples are as follows:
- RDBMS tables : For example, H2 database table, MySQL table.
- noSQL column families : For example, Cassandra column families.
We use Hive scripts to analyze collected data and derive meaningful information from them before storing in these RDBMS or noSQL data stores. But, Hive queries are only compatible with Hive tables. Therefore, before writing a Hive script on either a RDBMS table or Cassandra column family, you should create virtual Hive tables based on the actual RDBMS or noSQL data stores.
When you run a Hive script, it refers to the RDBMS table structures or column families where the processed information should be stored, and defines its own virtual tables by wrapping the existing physical tables. This virtual table structure is similar to SQL tables defined in RDBMS. For example, when Hive creates a virtual table by wrapping a MySQL table, the Hive table gets a set of fields corresponding to each data field in the MySQL table. You should individually map all the data fields you want to define in the Hive table with the MySQL table. Unmapped fields will not be available and visible in the Hive table.
The Hive engine maintains Hive tables by keeping a metadata set related to each table or column family in the actual table. Therefore, updating or deleting the actual table or column family using a third-party software will not update the metadata kept in the virtual Hive table. We recommend you to perform these operations on an SQL table. There is no need to create an SQL database using a separate tool or software, because creating a Hive table with the Hive scripts given below will automatically create the required SQL tables for you.
Creating Hive tables for various data sources
BAM uses storage handlers to connect to various data sources from Hive. Currently, there are two main storage handlers in BAM to connect to Cassandra and relational data stores. They are as follows:
At the time Hive tables are created, you can specify data source connection parameters using the CREATE TABLE query as Hive SERDEPROPERTIES (for noSQL) or TBLPROPERTIES (for RDBMS). Let's take a look at a few examples of each.
...
In noSQL databases like Cassandra, column families are logically similar to tables in RDBMS. A set of key-value pairs in NoSQL databases is similar to data fields in RDBMS. There can be an arbitrary number of keys in the rows of a single column family, but you should select a set of them before mapping to fields of the virtual Hive table. BAM uses Cassandra storage handler to connect to Cassandra databases from Hive. You can define Cassandra storage handlers using org.apache.hadoop.hive.cassandra.CassandraStorageHandler class in the STORED BY clause. Let's look at a few examples.
Example 1:This is a basic example query to create a virtual Hive table by the name ActivityDataTable corresponding to a physical Cassandra column family.
Code Block |
---|
CREATE EXTERNAL TABLE IF NOT EXISTS ActivityDataTable
(messageID STRING, sentTimestamp BIGINT, activityID STRING, version STRING, soapHeader STRING, soapBody STRING, host STRING)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES (
"cassandra.host" = "127.0.0.1" ,
"cassandra.port" = "9160" ,
"cassandra.ks.name" = "EVENT_KS" ,
"cassandra.ks.username" = "admin" ,
"cassandra.ks.password" = "admin" ,
"cassandra.cf.name" = "org_wso2_bam_activity_monitoring" ,
"cassandra.columns.mapping" =
":key, payload_timestamp, correlation_bam_activity_id, Version, payload_SOAPHeader, payload_SOAPBody, meta_host" ); |
Note the following regarding the above query:
- The new handler org.apache.hadoop.hive.cassandra.CassandraStorageHandler is used instead of the JDBC handler class.
- WITH SERDEPROPERTIES is used instead of TBLPROPERTIES command.
- The Cassandra storage handler class takes the following parameters for its SerDe properties.
- cassandra.host : Host names/IP Addresses of Cassandra nodes. You can use a comma-separated list for multiple nodes in the Cassandra ring for fail-over.
- cassandra.port : The port through which Cassandra listens to client requests.
- cassandra.ks.name : Cassandra Keyspace name. Keyspaces are logically similar to databases in RDBMS. The connection parameters, host, port, username and password are declared explicitly in cassandra.host, cassandra.port, cassandra.ks.username and cassandra.ks.password respectively. The name of the keyspace is EVENT_KS by default. To change this, edit the <keySpaceaName> element in <BAM_HOME>/repository/conf/ data-bridges/data-bridge- config.xml file.
- cassandra.ks.username : Username (username@tenant_domain if in Stratos) for authenticating Cassandra Keyspace. If no authentication is required to the Keyspace to be connected, you can skip this.
- cassandra.ks.password : Password for authenticating the Cassandra Keyspace. If no authentication is required to the Keyspace to be connected, you can skip this.
- cassandra.cf.name : Cassandra ColumnFamily name. In this example, org_wso2_bam_activity_monitoring is set as the column family name.
- cassandra.columns.mapping : Used to map the Cassandra column family keys to the Hive table fields. Should be in the same order as the Hive field definitions in CREATE TABLE. So the Hive table fields messageID, sentTimestamp, activityID, version, soapHeader, soapBody and host are mapped to the column family keys (keys of key-value pairs) by the names :key, payload_timestamp, correlation_bam_activity_id, Version, payload_SOAPHeader, payload_SOAPBody and meta_host. The reason is because the column family is already created, and the Hive script only creates the mapped Hive table onto the existing column family. :key is the unique row key available for each row in the Cassandra column family. You should map this field with a Hive table field in every Hive script.
Example 2: The query below also maps a set of ColumnFamily columns to a Hive table. The row ColumnFamily is mapped to orderID. :key is a special keyword in the column mapping and it refers to ColumnFamily row key.This section introduces you to Apache Hive query language (HQL) and how to set up databases and write Hive scripts to process and analyze the data stored in RDBMS or noSQL data bases. All Hive-related configurations in BAM are included in the following files.
- For Linux : <BAM_HOME>/repository/conf/advanced/hive-site.xml
- For Windows : <BAM_HOME>/repository/conf/advanced/hive-site-win.xml
The following topics are covered:
Table of Contents | ||||
---|---|---|---|---|
|
Hive queries in BAM
Apache Hive is a data warehouse system for Hadoop. It facilitates easy data summarization, ad-hoc queries, and the analysis of large data sets stored in Hadoop-compatible file systems. Hive provides a mechanism to project structure onto data, and query the data using an SQL-like language called HiveQL.
This is not a complete Hive learning resource. We describe here only selected Hive syntax used in WSO2 BAM. If you are a beginner, we recommend you to also see other materials on Hive, including the online Apache Hive project documentation. Given below are good starting points:
- Hive Language Manual : https://cwiki.apache.org/confluence/display/Hive/LanguageManual
- Hive Tutorial : https://cwiki.apache.org/confluence/display/Hive/Tutorial
Business information processed by BAM is ideally stored in some RDBMS or noSQL data store. Few examples are as follows:
- RDBMS tables : For example, H2 database table, MySQL table.
- noSQL column families : For example, Cassandra column families.
We use Hive scripts to analyze collected data and derive meaningful information from them before storing in these RDBMS or noSQL data stores. But, Hive queries are only compatible with Hive tables. Therefore, before writing a Hive script on either a RDBMS table or Cassandra column family, you should create virtual Hive tables based on the actual RDBMS or noSQL data stores.
When you run a Hive script, it refers to the RDBMS table structures or column families where the processed information should be stored, and defines its own virtual tables by wrapping the existing physical tables. This virtual table structure is similar to SQL tables defined in RDBMS. For example, when Hive creates a virtual table by wrapping a MySQL table, the Hive table gets a set of fields corresponding to each data field in the MySQL table. You should individually map all the data fields you want to define in the Hive table with the MySQL table. Unmapped fields will not be available and visible in the Hive table.
The Hive engine maintains Hive tables by keeping a metadata set related to each table or column family in the actual table. Therefore, updating or deleting the actual table or column family using a third-party software will not update the metadata kept in the virtual Hive table. We recommend you to perform these operations on an SQL table. There is no need to create an SQL database using a separate tool or software, because creating a Hive table with the Hive scripts given below will automatically create the required SQL tables for you.
Creating Hive tables for various data sources
BAM uses storage handlers to connect to various data sources from Hive. Currently, there are two main storage handlers in BAM to connect to Cassandra and relational data stores. They are as follows:
At the time Hive tables are created, you can specify data source connection parameters using the CREATE TABLE query as Hive SERDEPROPERTIES (for noSQL) or TBLPROPERTIES (for RDBMS). Let's take a look at a few examples of each.
Creating a Hive table for Cassandra
Anchor | ||||
---|---|---|---|---|
|
In noSQL databases like Cassandra, column families are logically similar to tables in RDBMS. A set of key-value pairs in NoSQL databases is similar to data fields in RDBMS. There can be an arbitrary number of keys in the rows of a single column family, but you should select a set of them before mapping to fields of the virtual Hive table. BAM uses Cassandra storage handler to connect to Cassandra databases from Hive. You can define Cassandra storage handlers using org.apache.hadoop.hive.cassandra.CassandraStorageHandler class in the STORED BY clause. Let's look at a few examples.
Example 1:This is a basic example query to create a virtual Hive table by the name ActivityDataTable corresponding to a physical Cassandra column family.
Code Block |
---|
CREATE EXTERNAL TABLE IF NOT EXISTS ActivityDataTable
(messageID STRING, sentTimestamp BIGINT, activityID STRING, version STRING, soapHeader STRING, soapBody STRING, host STRING)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES (
"wso2.carbon.datasource.name" = "WSO2BAM_CASSANDRA_DATASOURCE",
"cassandra.partitioner" = "org.apache.cassandra.dht.RandomPartitioner",
"cassandra.cf.name" = "org_wso2_bam_activity_monitoring" ,
"cassandra.columns.mapping" =
":key, payload_timestamp, correlation_bam_activity_id, Version, payload_SOAPHeader, payload_SOAPBody, meta_host" ); |
Note the following regarding the above query:
- The new handler org.apache.hadoop.hive.cassandra.CassandraStorageHandler is used instead of the JDBC handler class.
- WITH SERDEPROPERTIES is used instead of TBLPROPERTIES command.
- The Cassandra storage handler class takes the following parameters for its SerDe properties.
- cassandra.host : Host names/IP Addresses of Cassandra nodes. You can use a comma-separated list for multiple nodes in the Cassandra ring for fail-over.
- cassandra.port : The port through which Cassandra listens to client requests.
- cassandra.ks.name : Cassandra Keyspace name. Keyspaces are logically similar to databases in RDBMS. The connection parameters, host, port, username and password are declared explicitly in cassandra.host, cassandra.port, cassandra.ks.username and cassandra.ks.password respectively. The name of the keyspace is EVENT_KS by default. To change this, edit the <keySpaceaName> element in <BAM_HOME>/repository/conf/ data-bridges/data-bridge- config.xml file.
- cassandra.ks.username : Username (username@tenant_domain if in Stratos) for authenticating Cassandra Keyspace. If no authentication is required to the Keyspace to be connected, you can skip this.
- cassandra.ks.password : Password for authenticating the Cassandra Keyspace. If no authentication is required to the Keyspace to be connected, you can skip this.
- cassandra.cf.name : Cassandra ColumnFamily name. In this example, org_wso2_bam_activity_monitoring is set as the column family name.
- cassandra.columns.mapping : Used to map the Cassandra column family keys to the Hive table fields. Should be in the same order as the Hive field definitions in CREATE TABLE. So the Hive table fields messageID, sentTimestamp, activityID, version, soapHeader, soapBody and host are mapped to the column family keys (keys of key-value pairs) by the names :key, payload_timestamp, correlation_bam_activity_id, Version, payload_SOAPHeader, payload_SOAPBody and meta_host. The reason is because the column family is already created, and the Hive script only creates the mapped Hive table onto the existing column family. :key is the unique row key available for each row in the Cassandra column family. You should map this field with a Hive table field in every Hive script.
Example 2: The query below also maps a set of ColumnFamily columns to a Hive table. The row ColumnFamily is mapped to orderID. :key is a special keyword in the column mapping and it refers to ColumnFamily row key.
Code Block | ||
---|---|---|
| ||
Hive_Schema = CREATE TABLE PhoneSalesTable (orderID STRING, brandName STRING, userName STRING, quantity INT, version STRING)
cassandra.columns.mapping = ":key,payload_brand, payload_user, payload_quantity, Version" |
Example 3: This sample query maps static and variable columns in ColumnFamily to a Hive table. map: specifies that all columns in a Cassandra row that do not belong to the specified fixed fields in the column mapping should be taken as a property map with String key values. This works if mixed data types are present within the variable properties.
You can use any name that ends in a colon instead of the map: identifier. For example, you can use a value like propertyBag: instead of map: as long as the name ends with a colon.
Code Block | ||
---|---|---|
| ||
Hive_Schema = CREATE TABLE PhoneSalesTable (orderID STRING, brandName STRING, userName STRING, quantity INT, version STRING, properties map<string,string>)
cassandra.columns.mapping = ":key,payload_brand, payload_user, payload_quantity, Version, map:" ); |
Warning | ||
---|---|---|
| ||
Hive loads the entire row to memory for discovering the contents of the Map. This can cause out-of-memory issues in the Hive jobs. Therefore, we do not recommend using Map data type if the column family is based on wide rows. |
Example 4: This sample query maps the entire ColumnFamily as key value pairs (Transposed table). It maps all column names to a field in the Hive table and corresponding column values to another. This is useful for time series data stored within Cassandra as wide rows.
Code Block | ||
---|---|---|
| ||
Hive_Schema = CREATE TABLE PhoneSalesTimeSeriesTable (itemId STRING, quarter STRING, salesFigure INT)
cassandra.columns.mapping = ":key, :column, :value" |
You can map a Cassandra super column family as follows.
Code Block | ||
---|---|---|
| ||
Hive_Schema = CREATE TABLE PhoneSalesTablePhoneSalesTimeSeriesTable (orderIDitemId STRING, brandNamequarter STRING, userNamemonth STRING, quantitysalesFigure INT, version STRING) cassandra.columns.mapping = ":key,payload_brand, payload_user, payload_quantity, Version" |
Example 3: This sample query maps static and variable columns in ColumnFamily to a Hive table. map: specifies that all columns in a Cassandra row that do not belong to the specified fixed fields in the column mapping should be taken as a property map with String key values. This works if mixed data types are present within the variable properties.
...
:column, :subcolumn, :value" |
Note | ||
---|---|---|
| ||
Other than the Map data type, BAM does not support any other complex Hive data type with Cassandra Storage Handler at the moment. |
Info | |||||||
---|---|---|---|---|---|---|---|
If you are using a Cassandra version above 1.2.10, you need to add the following element in Hive queries you write:
|
Warning | ||
---|---|---|
| ||
Hive loads the entire row to memory for discovering the contents of the Map. This can cause out-of-memory issues in the Hive jobs. Therefore, we do not recommend using Map data type if the column family is based on wide rows. |
Example 4: This sample query maps the entire ColumnFamily as key value pairs (Transposed table). It maps all column names to a field in the Hive table and corresponding column values to another. This is useful for time series data stored within Cassandra as wide rows.
Code Block | ||
---|---|---|
| ||
Hive_Schema = CREATE TABLE PhoneSalesTimeSeriesTable (itemId STRING, quarter STRING, salesFigure INT)
cassandra.columns.mapping = ":key, :column, :value" |
You can map a Cassandra super column family as follows.
Code Block | ||
---|---|---|
| ||
Hive_Schema = CREATE TABLE PhoneSalesTimeSeriesTable (itemId STRING, quarter STRING, month STRING, salesFigure INT)
cassandra.columns.mapping = ":key, :column, :subcolumn, :value" |
Note | ||
---|---|---|
| ||
Other than the Map data type, BAM does not support any other complex Hive data type with Cassandra Storage Handler at the moment.
|
Using a datasource name for Cassandra storage handler in Hive scripts
...
- 'wso2.carbon.datasource.name' = 'WSO2BAM_DATASOURCE' : Data source name provides connection-related information such as username, password and port. WSO2BAM_DATASOURCE is an H2 data source that comes by default with BAM. You can define it in BAM_HOME/repository/conf/datasources/masterbam-datasources.xml file. Alternatively, you can also define the database connection parameters explicitly. We discuss the latter approach in Method 2 below.
- 'hive.jdbc.update.on.duplicate' = 'true' : Used to overwrite existing data rows in the table when it is updated with an entry that has the same primary key.
- 'hive.jdbc.primary.key.fields' = 'messageRowID' : Used to set primary key fields to the Hive table. In this example, only messageRowID is set as the primary key. But, you can set more than one primary key separated by comma. For example, 'hive.jdbc.primary.key.fields' = 'primarykey1, primarykey2, primarykey3'.
- 'hive.jdbc.table.create.query' = 'CREATE TABLE... : This is the only place where you should use real SQL commands that are run in the real RDBMS system. The SQL command to create the H2 database is : CREATE TABLE ActivitySummary (messageRowID VARCHAR(100) NOT NULL PRIMARY KEY, sentTimestamp BIGINT, bamActivityID VARCHAR(40), soapHeader TEXT, soapBody TEXT, host VARCHAR(25).
...
The previous section describes how to define a data source for the Hive table in masterbam-datasources.xml file. Method 2 here describes how to define the RDBMS connection parameters inline in the Hive scrip as properties.
...