Configuring Hadoop Cluster

WSO2 BAM implements data analysis using an Apache Hadoop-based big data analytics framework. Hadoop facilitates scaling BAM to handle large data volumes and uses Apache Hive for creating and executing analytic jobs. By default, Hive submits analytic jobs to a Hadoop instance running in local mode. But, you can set up a multi-node Hadoop cluster externally, and point to it.

Although BAM uses Apache Hadoop's MapReduce technology underneath, you do not have to write complex Hadoop jobs to process data. BAM decouples you from this underlying complexities and enables you to write data processing queries in SQL-like Hive query language. Hive provides you the right level of abstraction from Hadoop while internally submitting the analytic jobs to Hadoop. It spawns a Hadoop JVM internally or delegates to a Hadoop cluster.

Configuring the Hadoop cluster

Let's see how to configure the Hadoop cluster. Execute the following steps in all nodes in the BAM deployment unless otherwise specified (see here for more information on this).

Install Java in a location that all the user groups can access. For example, opt/java/jdk-1.6_29.
Install rsync using apt-get in order to copy the Hadoop configurations across all nodes.
Create a user by the name hadoop with the command: adduser hadoop
Log in as user hadoop using the command: su - hadoop

Key exchange for Passphraseless SSH
We need to have password/passphraseless SSH to communicate with other Hadoop nodes in the cluster. To establish an SSH with another node from node1, use the command: ssh hadoop@node2. To avoid this command requesting a password, set up the SSH key exchange among the Hadoop nodes.
Generate a key for the name node using the command: ssh-keygen. It creates a .ssh directory inside the user account of the user hadoop.
Inside the generated .ssh directory, there is a file with the key. Append this public key of node1 to the authorized_keys file in the other Hadoop nodes (node2 to node5) by executing the following commands and copying the id_rsa.pub file into the other nodes.

scp id_rsa.pub hadoop@node2:/home/hadoop scp id_rsa.pub hadoop@node3:/home/hadoop scp id_rsa.pub hadoop@ node4:/home/hadoop scp id_rsa.pub hadoop@node5:/home/hadoop
Log in to the second Hadoop node's hadoop user account and establish an SSH connection to another node from it . Use the command: ssh hadoop@node3. It creates the .ssh directory in the hadoop account.
Append the copied public key to the authorized_key file in the hadoop account of node2. Execute the following commands.
cat /home/hadoop/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys chown hadoop:hadoop /home/hadoop/.ssh/authorized_keys chmod 600 /home/hadoop/.ssh/authorized_keys
Now you can ssh to node1 from node2 without a password prompt. Log in to the Master node. From the hadoop account, log in to node2 using either of the commands:
ssh -i id_rsa.pub hadoop@192.168.4.101
ssh hadoop@192.168.4.101
If you still cannot establish an SSH connection to node2 without a password, run the following commands to node2.
```
cd ~
cd .ssh
chmod og-rw authorized_keys
chmod a-x authorized_keys
cd ~
chmod 700 .ssh
cd /home
chmod go-wrx hadoop
```
Carry out steps 4 to 6 on all other nodes as well. (node3 to node 5).
Configuring the master node
Define JAVA_HOME in <HADOOP_HOME>/conf/hadoop-env.sh file: export JAVA_HOME=/opt/java/jdk1.6.0_29. <HADOOP_HOME> refers to the path to Hadoop installation directory throughout this guide.

Edit the <HADOOP_HOME>/conf/core-site.xml file as follows:

 
<configuration>
	<property>
   		<name>fs.default.name</name>
   		<value>hdfs://node1:9000</value>
	</property>

	<property>
   		<name>fs.hdfs.impl</name>
   		<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
   		<description>The FileSystem for hdfs: uris.</description>
	</property>

	<property>
   		<name>hadoop.tmp.dir</name>
   		<value>/tmp/hadoop</value>
	</property>
</configuration>

Edit the <HADOOP_HOME>/conf/hdfs-site.xml as follows:

<configuration>
   <property>
       <name>dfs.replication</name>
       <value>1</value>
   </property>
   
   <property>
      <name>dfs.name.dir</name>
      <value><HADOOP_HOME>/dfs/name</value>
   </property>

   <property>
      <name>dfs.data.dir</name>
      <value><HADOOP_HOME>/dfs/data</value>
   </property>
</configuration>

Edit <HADOOP_HOME>/conf/mapred-site.xml as follows:

<configuration>
   <property>
      <name>mapred.job.tracker</name>
      <value>node1:9001</value>
   </property>

   <property>
  	  <name>mapred.system.dir</name>
  	  <value><HADOOP_HOME>/mapred/system</value>
   </property>
</configuration>

Edit <HADOOP_HOME>/conf/masters file and enter the following into it:
```
node2
```
In node1 and node2, edit <HADOOP_HOME>/conf/hadoop-policy.xml as follows. It enables write access for Hadoop user to Hadoop nodes.
```
<property>
   <name>security.job.submission.protocol.acl</name>
   <value>hadoop</value>
</property>
```
Be sure to change the <HADOOP_HOME> section that appears in the above configurations with the actual path to the Hadoop installation directory.
Edit <HADOOP_HOME>/conf/slaves as:
node3 node4 node5
Syncing Hadoop configurations across all nodes
Log in to the Master Hadoop node's hadoop account. From the Hadoop installation directory, execute the command below in order to propagate Hadoop configurations and binaries to node2: rsync -a -e ssh . hadoop@node2:/home/hadoop/hadoop.
Remove records from <HADOOP_HOME>/conf/masters and slaves in node2 files.
Be sure to repeat step 18 and 19 above in all other nodes (node 3 to node 5).
From the master node's Hadoop installation directory, execute the following command to format the namenode: bin/hadoop namenode -format.
Start the name node with the command: sh start-all.sh. All nodes should be started simultaneously .

See here to view the most common issues that you would encounter when setting up a multi-node Hadoop cluster.