This site contains the documentation that is relevant to older WSO2 product versions and offerings.
For the latest WSO2 documentation, visit https://wso2.com/documentation/.

Configuring Hadoop Cluster

WSO2 BAM implements data analysis using an Apache Hadoop-based big data analytics framework. Hadoop facilitates scaling BAM to handle large data volumes and uses Apache Hive for creating and executing analytic jobs.  By default, Hive submits analytic jobs to a Hadoop instance running in local mode. But, you can set up a multi-node Hadoop cluster externally, and point to it. 

Although BAM uses Apache Hadoop's MapReduce technology underneath, you do not have to write complex Hadoop jobs to process data. BAM decouples you from this underlying complexities and enables you to write data processing queries in SQL-like Hive query language. Hive provides you the right level of abstraction from Hadoop while internally submitting the analytic jobs to Hadoop. It spawns a Hadoop JVM internally or delegates to a Hadoop cluster. 

Configuring the Hadoop cluster

Let's see how to configure the Hadoop cluster. Execute the following steps in all nodes in the BAM deployment unless otherwise specified (see here for more information on this).

  1. Install Java in a location that all the user groups can access. For example, opt/java/jdk-1.6_29.  
  2. Install rsync using apt-get in order to copy the Hadoop configurations across all nodes.  
  3. Create a user by the name hadoop with the command: adduser hadoop     
  4. Log in as user hadoop using the command: su - hadoop 

    Key exchange for Passphraseless SSH   
  5. We need to have password/passphraseless SSH to communicate with other Hadoop nodes in the cluster. To establish an SSH with another node from node1, use the command:  ssh hadoop@node2. To avoid this command requesting a password, set up the SSH key exchange among the Hadoop nodes.  
  6. Generate a key for the name node using the command: ssh-keygenIt creates a .ssh directory inside the user account of the user hadoop.   
  7. Inside the generated .ssh  directory, there is a file with the key. Append this public key of node1 to the authorized_keys file in the other  Hadoop nodes (node2 to node5) by executing the following commands and copying the id_rsa.pub file into the other nodes.

    scp id_rsa.pub hadoop@node2:/home/hadoop  
    scp id_rsa.pub hadoop@node3:/home/hadoop 
     
    scp id_rsa.pub hadoop@ node4:/home/hadoop  

    scp id_rsa.pub hadoop@node5:/home/hadoop
  8. Log in to the second Hadoop node's hadoop user account and establish an SSH connection to another node from it . Use the command:  ssh hadoop@node3. It creates the .ssh directory in the hadoop account.
  9. Append the copied public key to the authorized_key file in the hadoop account of node2. Execute the following commands.
    cat /home/hadoop/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys  
    chown hadoop:hadoop /home/hadoop/.ssh/authorized_keys 
    chmod 600 /home/hadoop/.ssh/authorized_keys  
           
  10. Now you can ssh to node1 from node2 without a password prompt. Log in to the Master node. From the hadoop account, log in to node2 using either of the commands:
    ssh -i id_rsa.pub hadoop@192.168.4.101  
    ssh hadoop@192.168.4.101

    If you still cannot establish an SSH connection to node2 without a password, run the following commands to node2.  

    cd ~
    cd .ssh
    chmod og-rw authorized_keys
    chmod a-x authorized_keys
    cd ~
    chmod 700 .ssh
    cd /home
    chmod go-wrx hadoop
  11. Carry out steps 4 to 6 on all other nodes as well. (node3 to node 5).

    Configuring the master node 

  12. Define JAVA_HOME in <HADOOP_HOME>/conf/hadoop-env.sh file:  export JAVA_HOME=/opt/java/jdk1.6.0_29. <HADOOP_HOME> refers to the path to Hadoop installation directory throughout this guide.
  13. Edit the <HADOOP_HOME>/conf/core-site.xml file as follows:
     

     
    <configuration>
    	<property>
       		<name>fs.default.name</name>
       		<value>hdfs://node1:9000</value>
    	</property>
    
    	<property>
       		<name>fs.hdfs.impl</name>
       		<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
       		<description>The FileSystem for hdfs: uris.</description>
    	</property>
    
    	<property>
       		<name>hadoop.tmp.dir</name>
       		<value>/tmp/hadoop</value>
    	</property>
    </configuration>

     

  14. Edit the <HADOOP_HOME>/conf/hdfs-site.xml as follows:

    <configuration>
       <property>
           <name>dfs.replication</name>
           <value>1</value>
       </property>
       
       <property>
          <name>dfs.name.dir</name>
          <value><HADOOP_HOME>/dfs/name</value>
       </property>
    
       <property>
          <name>dfs.data.dir</name>
          <value><HADOOP_HOME>/dfs/data</value>
       </property>
    </configuration>
  15. Edit <HADOOP_HOME>/conf/mapred-site.xml as follows:

    <configuration>
       <property>
          <name>mapred.job.tracker</name>
          <value>node1:9001</value>
       </property>
    
       <property>
      	  <name>mapred.system.dir</name>
      	  <value><HADOOP_HOME>/mapred/system</value>
       </property>
    </configuration>
  16. Edit <HADOOP_HOME>/conf/masters file and enter the following into it:

    node2
  17. In node1 and node2, edit <HADOOP_HOME>/conf/hadoop-policy.xml as follows. It enables write access for Hadoop user to Hadoop nodes.

    <property>
       <name>security.job.submission.protocol.acl</name>
       <value>hadoop</value>
    </property>
    Be sure to change the <HADOOP_HOME> section that appears in the above configurations with the actual path to the Hadoop installation directory.
  18. Edit <HADOOP_HOME>/conf/slaves as:
    node3  
    node4
      
    node5

    Syncing Hadoop configurations across all nodes   

  19. Log in to the Master Hadoop node's hadoop account. From the Hadoop installation directory, execute the command below in order to propagate Hadoop configurations and binaries to node2:  rsync -a -e ssh . hadoop@node2:/home/hadoop/hadoop.   
  20. Remove records from <HADOOP_HOME>/conf/masters and slaves in node2 files.  
  21. Be sure to repeat step 18 and 19 above in all other nodes (node 3 to node 5).  
  22. From the master node's Hadoop installation directory, execute the following command to format the  namenode: bin/hadoop namenode -format.  
  23. Start the name node with the command: sh start-all.sh. All nodes should be started simultaneously  

    See here to view the most common issues that you would encounter when setting up a multi-node Hadoop cluster.