Configuring Data Analyzer Cluster

Data analyzer component of each BAM node uses Hive query language scripts to retrieve data from the Cassandra cluster, process the data into meaningful information, and save the information in an RDBMS. In this example, we use MySQL as the RDBMS. You get an H2 database with WSO2 BAM by default but it is not recommended in a high volume, production setting. The analyzer components in node1 and node2 are clustered in this setup and it extends the data processing part to yet another external Apache Hadoop cluster.

The data analyzer cluster uses the Registry to store metadata related to Hive scripts and scheduled tasks. It uses Hazelcast to handle coordination required by the nodes in the Analyzer clusters when running Hive scripts. These settings ensure high availability using a failover mechanism so that if one node fails, the rest can take up its load and complete the task. The diagram below depicts this setup:

The BAM nodes in the analyzer cluster are used for three main purposes:

Submit analytics queries to Hadoop cluster periodically as scheduled
Receive data from data agents and persist them to Cassandra cluster
Host end-user dashboards

The following steps provide instructions on how to configure the analyzer cluster. Here you must do the configurations in the analyzer nodes. The instructions in this section assume that node 1 and node 2 are the data analyzer nodes.

Do the following steps for both node 1 and node 2.

Download and extract WSO2 BAM to both analyzer nodes.
Place MySQL connector .jar file inside <BAM_HOME>/repository/components/lib folder. You must download this.

Add the following datasource configuration in <BAM_HOME>/repository/conf/datasources/master-datasources.xml file. Be sure to change the database URL and credentials according to your environment. The WSO2_REG_DB database is used in this example by the shared registry.

<datasource>
      <name>WSO2_REG_DB</name>
      <description>The datasource used for config</description>
      <jndiConfig>
          <name>jdbc/WSO2RegDB</name>
      </jndiConfig>
      <definition type="RDBMS">
          <configuration>
                   <url>jdbc:mysql://[host]:[port]/[reg-db]</url>
                   <username>reg_user</username>
                   <password>password</password>
                   <driverClassName>com.mysql.jdbc.Driver</driverClassName>
                   <maxActive>50</maxActive>
                   <maxWait>60000</maxWait>
                   <testOnBorrow>true</testOnBorrow>
                   <validationQuery>SELECT 1</validationQuery>
                   <validationInterval>30000</validationInterval>
           </configuration>
      </definition>
</datasource>

Add the following to <BAM_HOME>/repository/conf/registry.xml file. These are mounting configurations to share the registry for both nodes.

<dbConfig name="wso2GReg">
     <dataSource>jdbc/WSO2RegDB</dataSource>
 </dbConfig>
 
 <remoteInstance url="https://localhost:9443/registry">
     <id>registryInstance</id>
     <dbConfig>wso2GReg</dbConfig>
     <readOnly>false</readOnly>
     <registryRoot>/</registryRoot>
     <cacheId>root@jdbc:mysql://localhost:3306/governancedb</cacheId>
 </remoteInstance>
 
 <mount path="/_system/config" overwrite="true">
     <instanceId>registryInstance</instanceId>
     <targetPath>/_system/config</targetPath>
 </mount>
 
 <mount path="/_system/governance" overwrite="true">
     <instanceId>registryInstance</instanceId>
     <targetPath>/_system/governance</targetPath>
 </mount>

Now the registry has been mounted and shared in both nodes.

To create the registry schema, execute <BAM_HOME>/dbscripts/mysql.sql script as reg_user in MySQL reg-db database. This needs to be done only in one node as the registry is now shared.
Alternatively you could just use the following startup script to create the required tables (if they are not created already). Note that this also needs to be done only in one node.
sh wso2server.sh -Dsetup
Use bat wso2server.bat -Dsetup (for Windows). When starting up the server, you can also check if the registry has been mounted properly.
Edit the <BAM_HOME>/repository/conf/axis2/axis2.xml file and enable clustering as follows. This is to be done in both nodes.
<clustering class="org.wso2.carbon.core.clustering.hazelcast.HazelcastClusteringAgent" enable="true">
In above clustering configuration, make sure to also configure the following properties correctly.
1. membershipScheme - This indicates the cluster membership scheme being used. Set it to "multicast".
2. localMemberHost - The host name or IP address of the member. Set it to relevant host name of the machine (e.g., node1).
Add the following to the tasks-config.xml file, which is in the <BAM_HOME>/repository/conf/etc/ directory in the Analyzer nodes.
```
<taskServerMode>CLUSTERED</taskServerMode> 
<taskServerCount>2</taskServerCount>
```
About the task server count

This value indicates the number of task servers running in the cluster along with the analyzer nodes.
The task server count handles the analyzer node startup, where the analyzer nodes will hold the startup of the server until the number of servers specified in the taskServerCount property are started up. It is only when that count is reached that the startup of all of those servers will continue to the end.
The reason that the analyzer nodes are held in the startup is so that other analyzers can also join in when scheduling the tasks (Hive script jobs). That way the scripts will be shared between all the analyzers that are available, and all scripts will not just be scheduled initially in the first server that is started up.

The following configuration must be done if you wish to change the database used to store metadata for the Hive script. Modify <BAM_HOME>/repository/conf/advanced/hive-site.xml as follows. It has a line added to hive.aux.jars.path property to include MySQL connector JAR in Hadoop job execution runtime. Windows users must use the <BAM_HOME>/repository/conf/advanced/hive-site-win.xml file instead.

Additional details and recommendations

By default, this is stored in an H2 database, and these steps will enable this to be stored in MySQL as appropriate for this scenario. While this step is not a must, it is recommended for production environments to use a separate database instance such as MySQL or Oracle as a Hive metastore. See Configuring a Metadata Store for Hive for more information.

<property>
   <name>hadoop.embedded.local.mode</name>
   <value>false</value>
</property>
 
<property>
   <name>hive.metastore.warehouse.dir</name>
   <value>/user/hive/warehouse</value>
   <description>location of default database for the warehouse</description>
</property>
 
<property>
   <name>fs.default.name</name>
   <value>hdfs://node1:9000</value>
</property>
 
<property>
   <name>mapred.job.tracker</name>
   <value>node1:9001</value>
</property>
 
<property>    
   <name>hive.aux.jars.path</name>
   <value>file://${CARBON_HOME}/repository/components/plugins/apache-  
 
cassandra_1.2.13.wso2v2.jar,file://${CARBON_HOME}/repository/components/plugins/guava_12.0.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/json_2.0.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/commons-dbcp_1.4.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/commons-pool_1.5.6.wso2v1.jar,file://${CARBON_HOME}/repository/components/lib/mysql-connector-java-5.1.5-bin.jar
   </value>
</property>

Add the following configuration for WSO2BAM_DATASOURCE in <BAM_HOME>/repository/conf/datasources/bam-datasources.xml file of both analyzer nodes. Be sure to change the database URL and credentials according to your environment. WSO2BAM_DATASOURCE is the default data source available in BAM and it should be connected with the database you are using. This example uses the bam-db database to store BAM summary data.

Note that this configuration must be changed in the <BAM_HOME>/repository/conf/datasources/master-datasources.xml file if you are using BAM 2.4.0 instead of BAM 2.4.1.

<datasource>
            <name>WSO2BAM_DATASOURCE</name>
            <description>The datasource used for analyzer data</description>
            <definition type="RDBMS">
                <configuration>
                    <url>jdbc:mysql://localhost:3306/reg-db</url>
                    <username>root</username>
                    <password>admin</password>
                    <driverClassName>com.mysql.jdbc.Driver</driverClassName>
                    <maxActive>50</maxActive>
                    <maxWait>60000</maxWait>
                    <testOnBorrow>true</testOnBorrow>
                    <validationQuery>SELECT 1</validationQuery>
                    <validationInterval>30000</validationInterval>
                </configuration>
            </definition>
</datasource>

If you are using BAM 2.4.1, start the BAM server in both analyzer nodes, and use the Deployment Synchronizer to specify one node as a read/write node and one as a read-only node.

Tip: Note that there is no concept of worker/manager separation for the BAM cluster and the topic on SVN-based deployment synchronizer mentions worker and manager configurations. Consider the manager and worker nodes mentioned there as node 1 and node 2.
Additional instructions and points to note
When starting BAM instances, use disable.cassandra.server.startup property to stop running Cassandra bundled with BAM by default. We need to point to the external Cassandra cluster.
sh wso2server.sh -Ddisable.cassandra.server.startup=true
For BAM 2.4.0 or setups without SVN, remove BAM Toolbox Deployer feature using feature manager. We remove the feature because having deployers in both Analyzer BAM nodes interferes with proper Hive task fail-over functionality. We leave the BAM Toolbox Deployer feature in node1 so that it can copy the relevant files to the target location and schedule the Hive script.
BAM 2.4.1 gives you the option of disabling certain BAM components in addition to this. See here for more information on this.
You may also use the following to disable notifications.
-Ddisable.notification.task