Fully-Distributed, High-Availability BAM Setup

Shown in the diagram below is an example deployment of a fully-distributed, high-availability BAM setup. In this diagram, several data agents pass data to the BAM setup for analysis and summarization. For demonstration purposes, we have used two BAM nodes in this setup but you can extend it to add as many nodes as needed.

WSO2 BAM has three main components as data gathering, analysis and presentation. Each component is explained in detail in About BAM. We can take each of these components in the two BAM nodes and deploy them in separate clusters as shown in the diagram below:

Figure1: Fully-distributed BAM clustered setup

This setup persists and processes data in a distributed manner achieving high scalability and high availability in data collection, summarization and presentation layers. G iven below is an explanation of each component in the above diagram.

Data receiver cluster

This includes data agents such as WSO2 Application Server (or any other service-hosting product) and WSO2 ESB capturing and transferring data to subscribed storage units. The default storage unit is Cassandra that comes with WSO2 BAM. You can set up a Cassandra cluster as well. You can also set up data agents to talk to multiple data receivers in a load-balanced manner to ensure high availability. This way, if one receiver node fails, data agents can still transfer data over the other nodes. The following diagram depicts a common data receiver clustering pattern:

Figure2: Data receiver cluster

Cassandra cluster

Data that comes to BAM through data receivers is usually stored in the default Cassandra database. Figure1 above shows how the Cassandra databases of all two BAM nodes are deployed in a cluster. This ensures that even if one node fails, data can be received and stored in other databases in the cluster, and also ensures high availability of data to run the Hive scripts on.

Data analyzer cluster

Data analyzer component of each BAM node uses Hive query language scripts to retrieve data from the Cassandra cluster, process the data into meaningful information, and save the information in an RDBMS. In this example, we use MySQL as the RDBMS. You get an H2 database with WSO2 BAM by default but it is not recommended in a high volume, production setting. The analyzer components in node1 and node2 are clustered in this setup and it extends the data processing part to yet another external Apache Hadoop cluster.

Hadoop cluster

WSO2 BAM implements data analysis using an Apache Hadoop-based big data analytics framework. Hadoop facilitates scaling BAM to handle large data volumes and uses Apache Hive for creating and executing analytic jobs. By default, Hive submits analytic jobs to a Hadoop instance running in local mode. But, you can set up a multi-node Hadoop cluster externally, and point to it.

Although BAM uses Apache Hadoop's MapReduce technology underneath, you do not have to write complex Hadoop jobs to process data. BAM decouples you from this underlying complexities and enables you to write data processing queries in SQL-like Hive query language. Hive provides you the right level of abstraction from Hadoop while internally submitting the analytic jobs to Hadoop. It spawns a Hadoop JVM internally or delegates to a Hadoop cluster.

Zookeeper cluster

Apache Zookeeper manages coordination required by the nodes in the Analyzer cluster when running Hive scripts. Zookeeper can run separately or embedded within BAM. In this example setup, we have clustered three Zookeeper instances running on each node.

Presentation cluster

The presentation layer of BAM consists ofdashboards, reports, gadgets and other user interfaces. They can be set up in distributed manner as depicted in the following figure:

Figure3: Scaled up dashboards

Follow the steps below to configure this deployment.

Before you start, note the following:

The steps below are for Linux environment.
This example uses the hostnames of the nodes as node1, node2, node3, node4 and node 5 corresponding to Figure1 above.
Be sure to configure the host entries in all nodes.
This example uses Hadoop version 1.0.4, Zookeeper 3.3.6 and MySQL 5.1.

Configuring data receivers

Data receiver clustering is handled from the client side. For instructions on setting up data agents to talk to multiple data receivers, see Setting up Multi Receiver and Load Balancing Data Agent.

Configuring the Hadoop cluster

Let's see how to configure the Hadoop cluster. Execute the following steps in all nodes in the BAM deployment unless otherwise specified.

Install Java in a location that all the user groups can access. For example, opt/java/jdk-1.6_29.
Install rsync using apt-get in order to copy the Hadoop configurations across all nodes.
Create a user by the name hadoop with the command: useradd -m hadoop
Log in as user hadoop using the command: su - hadoop

Key exchange for Passphraseless SSH
We need to have password/passphraseless SSH to communicate with other Hadoop nodes in the cluster. To establish an SSH with another node from node1, use the command: ssh hadoop@node2. To avoid this command requesting a password, set up the SSH key exchange among the Hadoop nodes.
Generate a key for the name node using the command: ssh-keygen . It creates an .ssh directory inside the user account of the user hadoop.
Inside the generated .ssh directory, there is a file with the key. Append this public key of node1 to the authorized_keys file in the other Hadoop nodes (node2 to node5) by executing the following commands and copying the id_rsa.pub file into the other nodes.

scp id_rsa.pub hadoop@node2:/home/hadoop scp id_rsa.pub hadoop@node3:/home/hadoop scp id_rsa.pub hadoop@ node4:/home/hadoop scp id_rsa.pub hadoop@node5:/home/hadoop
Log in to the second Hadoop node's hadoop user account and establish an SSH connection to another node from it . Use the command: ssh hadoop@node3. It creates the .ssh directory in the hadoop account.
Append the copied public key to the authorized_key file in the hadoop account of node2. Execute the following commands.

cat /home/hadoop/id_rsa.pub > /home/hadoop/.ssh/authorized_keys chown hadoop:hadoop authorized_keys chmod 600 authorized_keys
Now you can ssh to node1 from node2 without a password prompt. Log in to the Master node. From the hadoop account, log in to node2 using either of the commands:
ssh -i id_rsa hadoop@192.168.4.101
ssh hadoop@192.168.4.101
If yo still cannot establish an SSH connection to node2 without a password, run the following commands to node2.
```
cd ~
cd .ssh
chmod og-rw authorized_keys
chmod a-x authorized_keys
cd ~
chmod 700 .ssh
cd /home
chmod go-wrx hadoop
```
Carry out steps 4 to 6 on all other nodes as well. (node3 to node 5).

Configuring the master node
Define JAVA_HOME in <HADOOP_HOME>/conf/hadoop-env.sh file: export JAVA_HOME=/opt/java/jdk1.6.0_29. <HADOOP_HOME> refers to the path to Hadoop installation directory throughout this guide.

Edit the <HADOOP_HOME>/conf/core-site.xml file as follows:

<property>
   <name>fs.default.name</name>
   <value>hdfs://node1:9000</value>
</property>

<property>
   <name>fs.hdfs.impl</name>
   <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
   <description>The FileSystem for hdfs: uris.</description>
</property>

<property>
   <name>hadoop.tmp.dir</name>
   <value>/tmp/hadoop</value>
</property>

E dit the <HADOOP_HOME>/conf/hdfs-site.xml as follows:

<configuration>
   <property>
       <name>dfs.replication</name>
       <value>1</value>
   </property>
   
   <property>
      <name>dfs.name.dir</name>
      <value><HADOOP_HOME>/dfs/name</value>
   </property>

   <property>
      <name>dfs.data.dir</name>
      <value><HADOOP_HOME>/dfs/data</value>
   </property>
</configuration>

Edit <HADOOP_HOME>/conf/mapred-site.xml as follows:

<configuration>
   <property>
      <name>mapred.job.tracker</name>
      <value>node1:9001</value>
   </property>

   <property>
  	  <name>mapred.system.dir</name>
  	  <value><HADOOP_HOME>/mapred/system</value>
   </property>
</configuration>

Edit <HADOOP_HOME>/conf/masters as follows:
In node2, edit <HADOOP_HOME>/conf/hadoop-policy.xml as follows. It enables write access for hadoop user to Hadoop nodes.
```
<property>
   <name>security.job.submission.protocol.acl</name>
   <value>hadoop</value>
</property>
```
Be sure to change the <HADOOP_HOME> section that appears in the above configurations with the actual path to the Hadoop installation directory.
Edit <HADOOP_HOME>/conf/slaves as
node3 node4 node5
Syncing Hadoop configurations across all nodes
Log in to the Master Hadoop node's hadoop account. From the Hadoop installation directory, execute the command below in order to propagate Hadoop configurations and binaries to node2: rsync -a -e ssh . hadoop@node2:/home/hadoop/hadoop.
Remove records from <HADOOP_HOME>/conf/masters and slaves in node2 files.
Be sure to repeat step 18 and 19 above in all other nodes (node 3 to node 5).
From the master node's Hadoop installation directory, execute the following command to format the namenode: bin/hadoop namenode -format.
Start the name node with the command: sh start-all.sh. All nodes should be started simultaneously .

Configuring the Cassandra cluster

Before you start, increase the heap memory size of BAM nodes to at least 2 GB and sync times in all nodes.

Add the following configurations to < BAM_HOME>/repository/conf/etc/cassandra.yaml file in the nodes mentioned below.

To node1:

cluster_name:   Test Cluster
initial_token:  0 
seed_provider:
       - seeds: "node3"
listen_address: node3
rpc_address: node3
rpc_port: 9160

to node2:

cluster_name: Test Cluster
initial_token: 56713727820156410577229101238628035242
seed_provider:
       - seeds: "node3"
listen_address: node4
rpc_address: node4
rpc_port: 9160

to node3:

cluster_name: Test Cluster
initial_token: 113427455640312821154458202477256070485
seed_provider:
       - seeds: "node3"
listen_address: node5
rpc_address:    node5
rpc_port:	    9160

You can generate tokens for the nodes using the script available in http://www.datastax.com/docs/0.8/install/cluster_init#calculating-tokens-for-a-single-data-center.

Data receiver configurations

Change the < BAM_HOME>/repository/conf/cassandra-component.xml file in node1 and node2 as follows. This connects the nodes to Cassandra endpoints.

<Cassandra>
    <Cluster>
        <Name>Test Cluster</Name>
        <Nodes>node3:9160,node4:9160,node5:9160</Nodes>
        <DefaultPort>9160</DefaultPort>
        <AutoDiscovery disable="false" delay="1000"/>
    </Cluster>
</Cassandra>

Edit the <BAM_HOME>/repository/conf/advanced/streamdefn.xml file in node1 and node2 as follows. This changes replication factor and read/write consistency levels using which data receivers write data to Cassandra.

<StreamDefinition>
    <ReplicationFactor>3</ReplicationFactor>
    <ReadConsistencyLevel>QUORUM</ReadConsistencyLevel>
    <WriteConsistencyLevel>QUORUM</WriteConsistencyLevel>
    <StrategyClass>org.apache.cassandra.locator.SimpleStrategy</StrategyClass>
</StreamDefinition>

Configuring data analyzer cluster

The data analyzer cluster uses the Registry to store metadata related to Hive scripts and scheduled tasks. It uses Apache Zookeeper (version 3.3.6 in this example) to handle coordination required by the nodes in the Analyzer clusters when running Hive scripts. These settings ensure high availability using a failover mechanism so that if one node fails, the rest can take up its load and complete the task. T he diagram below depicts this setup:

Figure3: BAM data analyzer cluster

The BAM nodes in the analyzer cluster are used for three main purposes:

Submit analytics queries to Hadoop cluster periodically as scheduled
Receive data from data agents and persist them to Cassandra cluster
Host end-user dashboards

Let's see how to configure the analyzer cluster with node1 and node2.

Download and extract WSO2 BAM to node1 and node2 and execute steps 2 to 4 in all 2 nodes.

Configuring registry sharing
Place mysql connector jar inside <BAM_HOME>/repository/components/lib folder.

Add the following datasource configuration in <BAM_HOME>/repository/conf/datasources/ master-datasources.xml file. Be sure to change the database URL and credentials according to your environment. The WSO2_REG_DB database is used in this example by the shared registry.

<datasource>
      <name>WSO2_REG_DB</name>
      <description>The datasource used for config</description>
      <jndiConfig>
          <name>jdbc/WSO2RegDB</name>
      </jndiConfig>
      <definition type="RDBMS">
          <configuration>
                   <url>jdbc:mysql://[host]:[port]/[reg-db]</url>
                   <username>reg_user</username>
                   <password>password</password>
                   <driverClassName>com.mysql.jdbc.Driver</driverClassName>
                   <maxActive>50</maxActive>
                   <maxWait>60000</maxWait>
                   <testOnBorrow>true</testOnBorrow>
                   <validationQuery>SELECT 1</validationQuery>
                   <validationInterval>30000</validationInterval>
           </configuration>
      </definition>
</datasource>

Add the following to <BAM_HOME>/repository/conf/registry.xmlfile.

<dbConfig name="wso2GReg">
     <dataSource>jdbc/WSO2RegDB</dataSource>
 </dbConfig>
 
 <remoteInstance url="https://localhost:9443/registry">
     <id>registryInstance</id>
     <dbConfig>wso2GReg</dbConfig>
     <readOnly>false</readOnly>
     <registryRoot>/</registryRoot>
 </remoteInstance>
 
 <mount path="/_system/config" overwrite="true">
     <instanceId>registryInstance</instanceId>
     <targetPath>/_system/config</targetPath>
 </mount>
 
 <mount path="/_system/governance" overwrite="true">
     <instanceId>registryInstance</instanceId>
     <targetPath>/_system/governance</targetPath>
 </mount>

Execute <BAM_HOME>/dbscripts/mysql.sql script as reg_user in MySQL reg-dbdatabase. It creates the Registry schema for you.
Configuring Zookeeper ensemble
Download and extract Zookeeper to a preferred location in node3. This location is referred to as <ZOO_HOME> throughout this section. Download URL for version 3.3.6 is http://apache.osuosl.org/zookeeper/zookeeper-3.3.6/zookeeper-3.3.6.tar.gz.

Create a file named zoo.cfg inside <ZOO_HOME>/conf and add the three nodes as follows.

tickTime=2000
dataDir=$ZOO_HOME/data
clientPort=2181
tickTime=2000
initLimit=5
syncLimit=2
server.3=node3:2888:3888
server.4=node4:2888:3888
server.5=node5:2888:3888

Create a new directory called data under <ZOO_HOME> to hold Zookeeper data.
Add a file with name myid containing the server number in dataDir. For example, for node3, put 3 inside myid file.
Follow steps 6 to 9 for node 4 and node 5 as well.
Go to <ZOO_HOME>/bin and execute the following command to start the Zookeeper daemon in each of the nodes: sh zkServer.sh start

Configuring BAM analyzer nodes

Add the following to <BAM_HOME>/repository/conf/etc/coordination-client-config.xml file of node2. It specifies the two BAM analyzer nodes in this setup with their ports.

<CoordinationClientConfiguration enabled="true">
        <Servers>
            <Server host="node3" port="2181"/>
            <Server host="node4" port="2181"/>
            <Server host="node5" port="2181"/>
        </Servers>
        <SessionTimeout>5000</SessionTimeout>
</CoordinationClientConfiguration>

Add the following to <B AM_HOME>/repository/conf/etc/t asks-config.xml file in the Analyzer nodes.
```
<taskServerMode>CLUSTERED</taskServerMode>
<taskServerCount>1</taskServerCount>
```

Modify <BAM_HOME>/repository/conf/advanced/hive-site.xml as follows. It has a line added to hive.aux.jars.path property to include mysql connector JAR in Hadoop job execution runtime.

<property>
   <name>hadoop.embedded.local.mode</name>
   <value>false</value>
</property>
 
<property>
   <name>hive.metastore.warehouse.dir</name>
   <value>/user/hive/warehouse</value>
   <description>location of default database for the warehouse</description>
</property>
 
<property>
   <name>fs.default.name</name>
   <value>hdfs://node1:9000</value>
</property>
 
<property>
   <name>mapred.job.tracker</name>
   <value>node1:9001</value>
</property>
 
<property>    
   <name>hive.aux.jars.path</name>
   <value>file://${CARBON_HOME}/repository/components/plugins/apache-  
 
cassandra_1.1.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/guava_12.0.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/json_2.0.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/commons-dbcp_1.4.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/commons-pool_1.5.6.wso2v1.jar,file://${CARBON_HOME}/repository/components/lib/mysql-connector-java-5.1.5-bin.jar
   </value>
</property>

Add the mysql connector jar to <BAM_HOME>/repository/components/lib directory.

Add the following to WSO2BAM_DATASOURCE in < BAM_HOME>/repository/conf/datasources/ master-datasources.xml file of node2. Be sure to change the database URL and credentials according to your environment. WSO2BAM_DATASOURCE is the default data source available in BAM and it should be connected with the database you are using. This example uses bam-db database to store BAM summary data.

<datasource>
    <name>WSO2BAM_DATASOURCE</name>
    <description>The datasource used for registry and user manager</description>
    <jndiConfig>
         <name>jdbc/WSO2CarbonDB</name>
    </jndiConfig>
    <definition type="RDBMS">
         <configuration>
             <url>jdbc:mysql://[host]:[port]/[bam-db]</url>
             <username>bam_user</username>
             <password>password</password>
             <driverClassName>com.mysql.jdbc.Driver</driverClassName>
             <maxActive>50</maxActive>
             <maxWait>60000</maxWait>
             <testOnBorrow>true</testOnBorrow>
             <validationQuery>SELECT 1</validationQuery>
             <validationInterval>30000</validationInterval>
          </configuration>
    </definition>
 </datasource>

Repeat the BAM analyzer node configurations in node1 as well.
Start the BAM server in node2 and remove BAM Tool Box Deployer feature using feature manager. We remove the feature because having deployers in both Analyzer BAM nodes interferes with proper Hive task fail-over functionality.

When starting BAM instances, use disable.cassandra.server.startup property to stop starting Cassandra bundled with BAM by default. We need to point to the external Cassandra cluster.
sh wso2server.sh -Ddisable.cassandra.server.startup=true