This site contains the documentation that is relevant to older WSO2 product versions and offerings.
For the latest WSO2 documentation, visit https://wso2.com/documentation/.
Configuring Hadoop Cluster
WSO2 BAM implements data analysis using an Apache Hadoop-based big data analytics framework. Hadoop facilitates scaling BAM to handle large data volumes and uses Apache Hive for creating and executing analytic jobs. By default, Hive submits analytic jobs to a Hadoop instance running in local mode. But, you can set up a multi-node Hadoop cluster externally, and point to it.
Although BAM uses Apache Hadoop's MapReduce technology underneath, you do not have to write complex Hadoop jobs to process data. BAM decouples you from this underlying complexities and enables you to write data processing queries in SQL-like Hive query language. Hive provides you the right level of abstraction from Hadoop while internally submitting the analytic jobs to Hadoop. It spawns a Hadoop JVM internally or delegates to a Hadoop cluster.
Configuring the Hadoop cluster
Let's see how to configure the Hadoop cluster. Execute the following steps in all nodes in the BAM deployment unless otherwise specified (see here for more information on this).
- Install Java in a location that all the user groups can access. For example,
opt/java/jdk-1.6_29
. - Install
rsync
usingapt-get
in order to copy the Hadoop configurations across all nodes. - Create a user by the name hadoop with the command:
adduser hadoop
- Log in as user hadoop using the command:
su - hadoop
Key exchange for Passphraseless SSH - We need to have password/passphraseless SSH to communicate with other Hadoop nodes in the cluster. To establish an SSH with another node from node1, use the command:
ssh hadoop@node2
. To avoid this command requesting a password, set up the SSH key exchange among the Hadoop nodes. - Generate a key for the name node using the command:
ssh-keygen
. It creates a .ssh directory inside the user account of the user hadoop. - Inside the generated .ssh directory, there is a file with the key. Append this public key of node1 to the
authorized_keys
file in the other Hadoop nodes (node2 to node5) by executing the following commands and copying theid_rsa.pub
file into the other nodes.scp id_rsa.pub hadoop@node2:/home/hadoop
scp id_rsa.pub hadoop@node3:/home/hadoop
scp id_rsa.pub hadoop@ node4:/home/hadoop
scp id_rsa.pub hadoop@node5:/home/hadoop - Log in to the second Hadoop node's hadoop user account and establish an SSH connection to another node from it . Use the command:
ssh hadoop@node3
. It creates the .ssh directory in the hadoop account. - Append the copied public key to the
authorized_key
file in the hadoop account of node2. Execute the following commands.cat /home/hadoop/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys
chown hadoop:hadoop /home/hadoop/.ssh/authorized_keys
chmod 600 /home/hadoop/.ssh/authorized_keys - Now you can ssh to node1 from node2 without a password prompt. Log in to the Master node. From the hadoop account, log in to node2 using either of the commands:
ssh -i id_rsa.pub hadoop@192.168.4.101
ssh hadoop@192.168.4.101
If you still cannot establish an SSH connection to node2 without a password, run the following commands to node2.
cd ~ cd .ssh chmod og-rw authorized_keys chmod a-x authorized_keys cd ~ chmod 700 .ssh cd /home chmod go-wrx hadoop
- Carry out steps 4 to 6 on all other nodes as well. (node3 to node 5).
Configuring the master node
- Define
JAVA_HOME
in<HADOOP_HOME>/conf/hadoop-env.sh
file:export JAVA_HOME=/opt/java/jdk1.6.0_29
.<HADOOP_HOME>
refers to the path to Hadoop installation directory throughout this guide. Edit the
<HADOOP_HOME>/conf/core-site.xml
file as follows:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://node1:9000</value> </property> <property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> <description>The FileSystem for hdfs: uris.</description> </property> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop</value> </property> </configuration>
Edit the
<HADOOP_HOME>/conf/hdfs-site.xml
as follows:<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value><HADOOP_HOME>/dfs/name</value> </property> <property> <name>dfs.data.dir</name> <value><HADOOP_HOME>/dfs/data</value> </property> </configuration>
Edit
<HADOOP_HOME>/conf/mapred-site.xml
as follows:<configuration> <property> <name>mapred.job.tracker</name> <value>node1:9001</value> </property> <property> <name>mapred.system.dir</name> <value><HADOOP_HOME>/mapred/system</value> </property> </configuration>
Edit
<HADOOP_HOME>/conf/masters
file and enter the following into it:node2
In node1 and node2, edit
<HADOOP_HOME>/conf/hadoop-policy.xml
as follows. It enables write access for Hadoop user to Hadoop nodes.<property> <name>security.job.submission.protocol.acl</name> <value>hadoop</value> </property>
- Edit
<HADOOP_HOME>/conf/slaves
as:node3
node4
node5Syncing Hadoop configurations across all nodes
- Log in to the Master Hadoop node's hadoop account. From the Hadoop installation directory, execute the command below in order to propagate Hadoop configurations and binaries to node2:
rsync -a -e ssh . hadoop@node2:/home/hadoop/hadoop.
- Remove records from
<HADOOP_HOME>/conf/masters
andslaves
in node2 files. - Be sure to repeat step 18 and 19 above in all other nodes (node 3 to node 5).
- From the master node's Hadoop installation directory, execute the following command to format the namenode:
bin/hadoop namenode -format
. Start the name node with the command:
sh start-all.sh
. All nodes should be started simultaneously .See here to view the most common issues that you would encounter when setting up a multi-node Hadoop cluster.