Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This section introduces you to Apache Hive query language (HQL) and how to set up databases and write Hive scripts to process and analyze the data stored in RDBMS or noSQL data bases. All Hive-related configurations in BAM are included in the following files.

...

 

Code Block
CREATE EXTERNAL TABLE IF NOT EXISTS ActivityDataTable
    (messageID STRING, sentTimestamp BIGINT, activityID STRING, version STRING, soapHeader STRING, soapBody STRING, host STRING)
   
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
   
WITH SERDEPROPERTIES (
   
"cassandra.host" = "127.0.0.1" ,
    "cassandra.portwso2.carbon.datasource.name" = "9160WSO2BAM_CASSANDRA_DATASOURCE" ,
    "cassandra.ks.namepartitioner" = "EVENT_KS" ,
    "cassandra.ks.username" = "admin" ,
    "cassandra.ks.password" = "admin" ,
   org.apache.cassandra.dht.RandomPartitioner",
"cassandra.cf.name" = "org_wso2_bam_activity_monitoring" ,
   
"cassandra.columns.mapping" =
   
":key, payload_timestamp, correlation_bam_activity_id, Version, payload_SOAPHeader, payload_SOAPBody, meta_host" );

Note the following regarding the above query:

  • The new handler org.apache.hadoop.hive.cassandra.CassandraStorageHandler is used instead of the JDBC handler class.
  • WITH SERDEPROPERTIES is used instead of TBLPROPERTIES command.
  • The Cassandra storage handler class takes the following parameters for its SerDe properties.
    • cassandra.host : Host names/IP Addresses of Cassandra nodes. You can use a comma-separated list for multiple nodes in the Cassandra ring for fail-over.
    • cassandra.port : The port through which Cassandra listens to client requests.
    • cassandra.ks.name : Cassandra Keyspace name. Keyspaces are logically similar to databases in RDBMS. The connection parameters, host, port, username and password are declared explicitly in cassandra.host, cassandra.port, cassandra.ks.username and cassandra.ks.password respectively. The name of the keyspace is EVENT_KS by default. To change this, edit the <keySpaceaName> element in <BAM_HOME>/repository/conf/ data-bridges/data-bridge- config.xml file.  
    • cassandra.ks.username : Username (username@tenant_domain if in Stratos) for authenticating Cassandra Keyspace. If no authentication is required to the Keyspace to be connected, you can skip this.
    • cassandra.ks.password : Password for authenticating the Cassandra Keyspace. If no authentication is required to the Keyspace to be connected, you can skip this.
    • cassandra.cf.name : Cassandra ColumnFamily name. In this example, org_wso2_bam_activity_monitoring is set as the column family name.
    • cassandra.columns.mapping : Used to map the Cassandra column family keys to the Hive table fields. Should be in the same order as the Hive field definitions in CREATE TABLE. So the Hive table fields messageID, sentTimestamp, activityID, version, soapHeader, soapBody and host are mapped to the column family keys (keys of key-value pairs) by the names :key, payload_timestamp, correlation_bam_activity_id, Version, payload_SOAPHeader, payload_SOAPBody and meta_host. The reason is because the column family is already created, and the Hive script only creates the mapped Hive table onto the existing column family. :key is the unique row key available for each row in the Cassandra column family. You should map this field with a Hive table field in every Hive script.

...

The script given below is based on an H2 table. You can also find it in KPI Phone Retail Store sample of WSO2 BAM.

 

Code Block
CREATE EXTERNAL TABLE IF NOT EXISTS UserTable(
    name STRING, totalOrders INT, totalQuantity INT) 
    STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler'
    TBLPROPERTIES ( 
    'mapred.jdbc.driver.class' = 'org.h2.Driver' , 
    'mapred.jdbc.url' = 'jdbc:h2:repository/database/samples/WSO2CARBON_DB;AUTO_SERVER=TRUE' , 
    'mapred.jdbc.username' = 'wso2carbon' , 
    'mapred.jdbc.password' = 'wso2carbon' , 
    'hive.jdbc.update.on.duplicate' = 'true' , 
    'hive.jdbc.primary.key.fields' = 'name' , 
    'hive.jdbc.table.create.query' = 
    'CREATE TABLE UserSummary (name VARCHAR(100) NOT NULL PRIMARY KEY,
     totalOrders  INT, totalQuantity INT)' );

Note the following regarding the above query.

...