Fully-Distributed, High-Availability BAM Setup
Shown in the diagram below is an example deployment of a fully-distributed, high-availability BAM setup. In this diagram, several data agents pass data to the BAM setup for analysis and summarization. For demonstration purposes, we have used two BAM nodes in this setup but you can extend it to add as many nodes as needed.
WSO2 BAM has three main components as data gathering, analysis and presentation. Each component is explained in detail in About BAM. We can take each of these components in the two BAM nodes and deploy them in separate clusters as shown in the diagram below:
Figure1: Fully-distributed BAM clustered setup
This setup persists and processes data in a distributed manner achieving high scalability and high availability in data collection, summarization and presentation layers. G iven below is an explanation of each component in the above diagram.
Data receiver cluster
This includes data agents such as WSO2 Application Server (or any other service-hosting product) and WSO2 ESB capturing and transferring data to subscribed storage units. The default storage unit is Cassandra that comes with WSO2 BAM. You can set up a Cassandra cluster as well. You can also set up data agents to talk to multiple data receivers in a load-balanced manner to ensure high availability. This way, if one receiver node fails, data agents can still transfer data over the other nodes. The following diagram depicts a common data receiver clustering pattern:
Figure2: Data receiver cluster
Cassandra cluster
Data that comes to BAM through data receivers is usually stored in the default Cassandra database. Figure1 above shows how the Cassandra databases of all twoĀ BAM nodes are deployed in a cluster. This ensures that even if one node fails, data can be received and stored in other databases in the cluster, and also ensures high availability of data to run the Hive scripts on.
Data analyzer cluster
Ā Data analyzer component of each BAM node uses Hive query language scripts to retrieve data from the Cassandra cluster, process the data into meaningful information, and save the information in an RDBMS. In this example, we use MySQL as the RDBMS. You get an H2 database with WSO2 BAM by default but it is not recommended in a high volume, production setting. The analyzer components in node1 and node2 are clustered in this setup and it extends the data processing part to yet another external Apache Hadoop cluster.
Hadoop cluster
WSO2 BAM implements data analysis using an Apache Hadoop-based big data analytics framework. Hadoop facilitates scaling BAM to handle large data volumes and uses Apache Hive for creating and executing analytic jobs.Ā By default, Hive submits analytic jobs to a Hadoop instance running in local mode. But, you can set up a multi-node Hadoop cluster externally, and point to it.Ā
Although BAM uses Apache Hadoop's MapReduce technology underneath, you do not have to write complex Hadoop jobs to process data. BAM decouples you from this underlying complexities and enables you to write data processing queries in SQL-like Hive query language. Hive provides you the right level of abstraction from Hadoop while internally submitting the analytic jobs to Hadoop. It spawns a Hadoop JVM internally or delegates to a Hadoop cluster.
Zookeeper cluster
Apache Zookeeper manages coordination required by the nodes in the Analyzer cluster when running Hive scripts. Zookeeper can run separately or embedded within BAM. In this example setup, we have clustered three Zookeeper instances running on each node.
Presentation cluster
The presentation layer of BAM consists ofdashboards, reports, gadgets and other user interfaces. They can be set up in distributed manner as depicted in the following figure:
Figure3: Scaled up dashboards
Follow the steps below to configure this deployment.
Before you start, note the following:
- The steps below are for Linux environment.Ā
- This example uses the hostnames of the nodes as node1, node2, node3, node4 and node 5 corresponding to Figure1 above.
- Be sure to configure the host entries in all nodes.
- This example uses Hadoop version 1.0.4, Zookeeper 3.3.6 and MySQL 5.1.
Configuring data receivers
Data receiver clustering is handled from the client side. For instructions on setting up data agents to talk to multiple data receivers, seeĀ Setting up Multi Receiver and Load Balancing Data Agent.
Configuring the Hadoop cluster
Let's see how to configure the Hadoop cluster. Execute the following steps in Ā all nodes in the BAM deployment unless otherwise specified.
- Install Java in a location that all the user groups can access. For example, Ā
opt/java/jdk-1.6_29
. Ā - Install
rsync
usingapt-get
in order to copy the Hadoop configurations across all nodes. Ā - Create a user by the name hadoop with the command:
useradd -m hadoop Ā
Ā - Log in as user hadoop using the command:
su - hadoop
Key exchange for Passphraseless SSH Ā - We need to have password/passphraseless SSH to communicate with other Hadoop nodes in the cluster. To establish an SSH with another node from node1, use the command:
ssh hadoop@node2
. To avoid this command requesting a password, set up the SSH key exchange among the Hadoop nodes. Ā - Generate a key for the name node using the command:
ssh-keygen
.Ā It creates an .ssh directory inside the user account of the user hadoop.Ā Ā - Inside the generated .ssh Ā directory, there is a file with the key. Append this public key of node1 to the
authorized_keys
file in the other Ā Hadoop nodes (node2 to node5) by executing the following commands and copying theid_rsa.pub
Ā file into the other nodes.
scp id_rsa.pub hadoop@node2:/home/hadoop
scp id_rsa.pub hadoop@node3:/home/hadoop
scp id_rsa.pub hadoop@ node4:/home/hadoop
scp id_rsa.pub hadoop@node5:/home/hadoop - Log in to the second Hadoop node's hadoop user account and establish an SSH connection to another node from it . Use the command:
ssh hadoop@node3
. It creates the .ssh directory in the hadoop account. - Append the copied public key to the
authorized_key
file in the hadoop account of node2. Ā Execute the following commands.
cat /home/hadoop/id_rsa.pub Ā > /home/hadoop/.ssh/authorized_keys
Ā Ā Ā Ā
chown hadoop:hadoop authorized_keys
chmod 600 authorized_keys - Now you can ssh to node1 from node2 without a password prompt. Log in to the Master node. Ā From the hadoop account, log in to node2 using either of the commands:
ssh -i id_rsa hadoop@192.168.4.101
ssh hadoop@192.168.4.101
If yo still cannot establish an SSH connection to node2 without a password, run the following commands to node2. Ā
cd ~ cd .ssh chmod og-rw authorized_keys chmod a-x authorized_keys cd ~ chmod 700 .ssh cd /home chmod go-wrx hadoop
- Carry out steps 4 to 6 on all other nodes as well. (node3 to node 5).
Configuring the master node
- Define JAVA_HOME in <HADOOP_HOME>/conf/hadoop-env.sh file:
export JAVA_HOME=/opt/java/jdk1.6.0_29
. <HADOOP_HOME> refers to the path to Hadoop installation directory throughout this guide. Edit the <HADOOP_HOME>/conf/core-site.xml file as follows:
Ā<property> <name>fs.default.name</name> <value>hdfs://node1:9000</value> </property> <property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> <description>The FileSystem for hdfs: uris.</description> </property> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop</value> </property>
Ā
E dit the <HADOOP_HOME>/conf/hdfs-site.xml as follows:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value><HADOOP_HOME>/dfs/name</value> </property> <property> <name>dfs.data.dir</name> <value><HADOOP_HOME>/dfs/data</value> </property> </configuration>
Edit <HADOOP_HOME>/conf/mapred-site.xml as follows:
<configuration> <property> <name>mapred.job.tracker</name> <value>node1:9001</value> </property> <property> <name>mapred.system.dir</name> <value><HADOOP_HOME>/mapred/system</value> </property> </configuration>
- Edit <HADOOP_HOME>/conf/masters as follows:
In node2, edit <HADOOP_HOME>/conf/hadoop-policy.xml as follows. It enables write access for hadoop user to Hadoop nodes.
<property> <name>security.job.submission.protocol.acl</name> <value>hadoop</value> </property>
- Edit <HADOOP_HOME>/conf/slaves as
node3
node4
node5
Syncing Hadoop configurations across all nodesĀ
- Log in to the Master Hadoop node's hadoop account. From the Hadoop installation directory, execute the command below in order to propagate Hadoop configurations and binaries to node2:
rsync -a -e ssh . hadoop@node2:/home/hadoop/hadoop.
Ā - Remove records from <HADOOP_HOME>/conf/masters and slaves in node2 files.
- Be sure to repeat step 18 and 19 above in all other nodes (node 3 to node 5). Ā
- From the master node's Hadoop installation directory, execute the following command to format the Ā namenode:
bin/hadoop namenode -format
. Ā - Start the name node with the command:
sh start-all.sh
. All nodes should be started simultaneously .Ā
Configuring the Cassandra cluster
Before you start, increase the heap memory size of BAM nodes to at least 2 GB and sync times in all nodes.
Add the following configurations to
<
BAM_HOME>/repository/conf/etc/cassandra.yaml
file in the nodes mentioned below.
To node1:cluster_name: Test Cluster initial_token: 0 seed_provider: - seeds: "node3" listen_address: node3 rpc_address: node3 rpc_port: 9160
to node2:
cluster_name: Test Cluster initial_token: 56713727820156410577229101238628035242 seed_provider: - seeds: "node3" listen_address: node4 rpc_address: node4 rpc_port: 9160
to node3:
cluster_name: Test Cluster initial_token: 113427455640312821154458202477256070485 seed_provider: - seeds: "node3" listen_address: node5 rpc_address: node5 rpc_port: 9160
You can generate tokens for the nodes using the script available inĀ http://www.datastax.com/docs/0.8/install/cluster_init#calculating-tokens-for-a-single-data-center.
Data receiver configurations
Change the <
BAM_HOME>/repository/conf/cassandra-component.xml
file in node1 and node2 as follows. This connects the nodes to Cassandra endpoints.<Cassandra> <Cluster> <Name>Test Cluster</Name> <Nodes>node3:9160,node4:9160,node5:9160</Nodes> <DefaultPort>9160</DefaultPort> <AutoDiscovery disable="false" delay="1000"/> </Cluster> </Cassandra>
Edit the <BAM_HOME>/repository/conf/advanced/streamdefn.xml file in node1 and node2 as follows. This changes replication factor and read/write consistency levels using which data receivers write data to Cassandra.
<StreamDefinition> <ReplicationFactor>3</ReplicationFactor> <ReadConsistencyLevel>QUORUM</ReadConsistencyLevel> <WriteConsistencyLevel>QUORUM</WriteConsistencyLevel> <StrategyClass>org.apache.cassandra.locator.SimpleStrategy</StrategyClass> </StreamDefinition>
Configuring data analyzer cluster
The data analyzer cluster uses the Registry to store metadata related to Hive scripts and scheduled tasks. It uses Apache Zookeeper (version 3.3.6 in this example) to handle coordination required by the nodes in the Analyzer clusters when running Hive scripts. These settings ensure high availability using a failover mechanism so that if one node fails, the rest can take up its load and complete the task. T he diagram below depicts this setup:
Figure3: BAM data analyzer cluster
The BAM nodes in the analyzer cluster are used for three main purposes:
- Submit analytics queries to Hadoop cluster periodically as scheduled Ā
- Receive data from data agents and persist them to Cassandra cluster
- Host end-user dashboards
Let's see how to configure the analyzer cluster with node1 and node2.Ā
- Download and extract WSO2 BAM to node1 and node2 and execute steps 2 to 4 in all 2 nodes.
Configuring registry sharing
- Place mysql connector jar inside
<BAM_HOME>/repository/components/lib
folder. Add the followingĀ datasource configuration in
<BAM_HOME>/repository/conf/datasources/
master-datasources.xml
file. Be sure to change the database URL and credentials according to your environment. TheWSO2_REG_DB
database is used in this example by the shared registry.
<datasource> <name>WSO2_REG_DB</name> <description>The datasource used for config</description> <jndiConfig> <name>jdbc/WSO2RegDB</name> </jndiConfig> <definition type="RDBMS"> <configuration> <url>jdbc:mysql://[host]:[port]/[reg-db]</url> <username>reg_user</username> <password>password</password> <driverClassName>com.mysql.jdbc.Driver</driverClassName> <maxActive>50</maxActive> <maxWait>60000</maxWait> <testOnBorrow>true</testOnBorrow> <validationQuery>SELECT 1</validationQuery> <validationInterval>30000</validationInterval> </configuration> </definition> </datasource>
- Add the following to
<BAM_HOME>/repository/conf/registry.xml
file.Ā
<dbConfig name="wso2GReg"> <dataSource>jdbc/WSO2RegDB</dataSource> </dbConfig> <remoteInstance url="https://localhost:9443/registry"> <id>registryInstance</id> <dbConfig>wso2GReg</dbConfig> <readOnly>false</readOnly> <registryRoot>/</registryRoot> </remoteInstance> <mount path="/_system/config" overwrite="true"> <instanceId>registryInstance</instanceId> <targetPath>/_system/config</targetPath> </mount> <mount path="/_system/governance" overwrite="true"> <instanceId>registryInstance</instanceId> <targetPath>/_system/governance</targetPath> </mount>
- ExecuteĀ
<BAM_HOME>/dbscripts/
script asmysql.sql
reg_user
in MySQLreg-db
database. It creates the Registry schema for you.Configuring Zookeeper ensemble
- Download and extract Zookeeper to a preferred location in node3. This location is referred to as <ZOO_HOME> throughout this section. Download URL for version 3.3.6 is http://apache.osuosl.org/zookeeper/zookeeper-3.3.6/zookeeper-3.3.6.tar.gz.Ā Ā
Create a file named
zoo.cfg
inside<ZOO_HOME>/conf
and add the three nodes as follows.tickTime=2000 dataDir=$ZOO_HOME/data clientPort=2181 tickTime=2000 initLimit=5 syncLimit=2 server.3=node3:2888:3888 server.4=node4:2888:3888 server.5=node5:2888:3888
- Create a new directory called
data
under<ZOO_HOME>
to hold Zookeeper data. - Add a file with name
myid
containing the server number in dataDir. For example, for node3, put 3 insideĀmyid
file. - Follow steps 6 to 9 for node 4 and node 5 as well. Ā
- Go to
<ZOO_HOME>/bin
and execute the following command to start the Zookeeper daemon in each of the nodes:sh zkServer.sh start
Configuring BAM analyzer nodes Add the following to
<BAM_HOME>/repository/conf/etc/coordination-client-config.xml
file of node2. It specifies the two BAM analyzer nodes in this setup with their ports.
<CoordinationClientConfiguration enabled="true"> <Servers> <Server host="node3" port="2181"/> <Server host="node4" port="2181"/> <Server host="node5" port="2181"/> </Servers> <SessionTimeout>5000</SessionTimeout> </CoordinationClientConfiguration>
Add the following to
<B
AM_HOME>/repository/conf/etc/t
asks-config.xml
file in the Analyzer nodes. Ā<taskServerMode>CLUSTERED</taskServerMode> <taskServerCount>1</taskServerCount>
Modify <BAM_HOME>/repository/conf/advanced/hive-site.xml as Ā follows. It has a line added to
hive.aux.jars.path
property toĀ include mysql connector JAR in Hadoop job execution runtime.
<property> <name>hadoop.embedded.local.mode</name> <value>false</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> <description>location of default database for the warehouse</description> </property> <property> <name>fs.default.name</name> <value>hdfs://node1:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>node1:9001</value> </property> <property> <name>hive.aux.jars.path</name> <value>file://${CARBON_HOME}/repository/components/plugins/apache- cassandra_1.1.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/guava_12.0.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/json_2.0.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/commons-dbcp_1.4.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/commons-pool_1.5.6.wso2v1.jar,file://${CARBON_HOME}/repository/components/lib/mysql-connector-java-5.1.5-bin.jar </value> </property>
- Add the mysql connector jar to
<BAM_HOME>/repository/components/lib
directory. Add the following to
WSO2BAM_DATASOURCE
in<
BAM_HOME>/repository/conf/datasources/
master-datasources.xml
file of node2. Be sure to change the database URL and Ā credentials according to your environment.WSO2BAM_DATASOURCE
is the default data source available in BAM and it should be connected with the database you are using. This example usesĀbam-db
database to store BAM summary data.
<datasource> <name>WSO2BAM_DATASOURCE</name> <description>The datasource used for registry and user manager</description> <jndiConfig> <name>jdbc/WSO2CarbonDB</name> </jndiConfig> <definition type="RDBMS"> <configuration> <url>jdbc:mysql://[host]:[port]/[bam-db]</url> <username>bam_user</username> <password>password</password> <driverClassName>com.mysql.jdbc.Driver</driverClassName> <maxActive>50</maxActive> <maxWait>60000</maxWait> <testOnBorrow>true</testOnBorrow> <validationQuery>SELECT 1</validationQuery> <validationInterval>30000</validationInterval> </configuration> </definition> </datasource>
- Repeat the BAM analyzer node configurations in node1 as well.
Start the BAM server in node2 and remove BAM Tool Box Deployer feature using feature manager. We remove the feature because having deployers in both Analyzer BAM nodes interferes with proper Hive taskĀ fail-over functionality.
When starting BAM instances, use
disable.cassandra.server.startup
property to stop starting Cassandra bundled with BAM by default. We need to point to the external Cassandra cluster.
sh wso2server.sh -Ddisable.cassandra.server.startup=true