Wednesday, June 25, 2014

Hadoop 2.2.0 (YARN) Single Node Cluster (Pseudo-Distributed Operation) setup

Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.
For more info on Hadoop, Click here

In this tutorial, we will set up Hadoop in pseudo-distributed mode, single-node cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.

Section 1: Prerequisites:


  1. Installing Java v1.8
  2. Adding dedicated Hadoop system user.
  3. Configuring SSH access.
  4. Disabling IPv6.

A. Installing Java v1.8
  1. Completely remove the OpenJDK/JRE from your system and create a directory to hold your Oracle Java JDK/JRE binaries.
    • Check if you have Java installed on your system.
      java -version


    • If you have Open JDK installed on your system, completely remove it using this:
      sudo apt-get purge openjdk-\*

    • If you have Oracle Java already installed on your system, you may skip first part.

    • Create Directory where you will install Oracle Java.
      sudo mkdir -p /usr/local/java

  2. Check to see if your Ubuntu Linux operating system architecture is 32-bit or 64-bit, open up a terminal and run the following command below.
    file /sbin/init

    • Note the bit version of your Ubuntu Linux operating system architecture it will display whether it is 32-bit or 64-bit. Download JDK 32/64 -bit from here.
    • Important Information: 64-bit Oracle Java binaries do not work on 32-bit Ubuntu Linux operating systems, you will receive multiple system error messages, if you attempt to install 64-bit Oracle Java on 32-bit Ubuntu Linux.

  3. Copy the Oracle Java binaries into the /usr/local/java directory.
    cd /home/"your_user_name"/Downloads
    sudo cp -r jdk-8u5-linux-x64.tar.gz /usr/local/java/

  4. Unpack the compressed Java binaries, in the directory /usr/local/java.
    sudo tar xvzf jdk-8u5-linux-i586.tar.gz

  5. Edit the system PATH file /etc/profile and add the following system variables to your system path.
    sudo gedit /etc/profile

    Type/Copy/Paste this in the file:

    JAVA_HOME=/usr/local/java/jdk1.8.0_05
    PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
    export JAVA_HOME
    export PATH

    Now save and exit.

  6. (a). Inform your Ubuntu Linux system where your Oracle Java JDK/JRE is located.
    • Type/Copy/Paste:
      sudo update-alternatives --install "/usr/bin/java" "java" "/usr/local/java/jdk1.8.0_05/bin/java" 1

      This command notifies the system that Oracle Java JRE is available for use.
    • Type/Copy/Paste:
      sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/local/java/jdk1.8.0_05/bin/javac" 1

      This command notifies the system that Oracle Java JDK is available for use.
    • Type/Copy/Paste:
      sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/local/java/jdk1.8.0_05/bin/javaws" 1

      This command notifies the system that Oracle Java Web start is available for use.
    (b). Inform your Ubuntu Linux system that Oracle Java JDK/JRE must be the default Java.

    • Type/Copy/Paste:
      sudo update-alternatives --set java /usr/local/java/jdk1.8.0_05/bin/java

      This command will set the java runtime environment for the system.
    • Type/Copy/Paste:
      sudo update-alternatives --set javac /usr/local/java/jdk1.8.0_05/bin/javac

      This command will set the javac compiler for the system.
    • Type/Copy/Paste:
      sudo update-alternatives --set javaws /usr/local/java/jdk1.8.0_05/bin/javaws

      This command will set Java Web start for the system.
    • Reload your system wide PATH /etc/profile by typing the following command: Type/Copy/Paste:
      source /etc/profile
    • Note your system-wide PATH /etc/profile file will reload after reboot of your Ubuntu Linux system.
    >> To check if installation is successful, Type/Copy/Paste:
    java -version

    >> This command displays the version of java running on your system. You should receive a message which displays:
    java version "1.8.0_05"
    Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
    Java HotSpot(TM) Server VM (build 24.51-b03, mixed mode)

    >> Now Type/Copy/Paste:
    javac -version

    This command lets you know that you are now able to compile Java programs from the terminal. You should receive a message which displays:
    javac 1.8.0_05
B. Adding dedicated Hadoop system user
  1. Adding group:
    sudo addgroup Hadoop
  2. Creating a user and adding the user to a group:
    sudo adduser –ingroup Hadoop hduser

C. Configuring SSH access

  1. Type/Copy/Paste:
    sudo apt-get install openssh-server

  2. Before starting of installing any applications or softwares, please makes sure your list of packages from all repositories and PPA’s is up to date or if not update them by using this command:
    sudo apt-get update

  3. The need for SSH Key based authentication is required so that the master node can then login to slave nodes (and the secondary node) to start/stop them and also local machine if you want to use Hadoop with it. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.

    Before this step you have to make sure that SSH is up and running on your machine and configured it to allow SSH public key authentication.

    Generating an SSH key for the hduser user.

    a. Login as hduser with sudo

    b. Run this Key generation command:
    ssh-keygen -t rsa -P ""

    c. It will ask to provide the file name in which to save the key, just press has entered so that it will generate the key at ‘/home/hduser/ .ssh’

    d. Enable SSH access to your local machine with this newly created key.
    cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

    e. The final step is to test the SSH setup by connecting to your local machine with the hduser user.
    ssh hduser@localhost

    This will add localhost permanently to the list of known hosts.
    You will get similar result as in below picture.


D. Disabling IPv6

We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations. You will need to run the following commands using a root account:

sudo gedit /etc/sysctl.conf

Add the following lines to the end of the file and reboot the machine, to update the configurations correctly.

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Section 2: Hadoop Installation:


Download Hadoop stable version from the Apache Download Mirrors.
Move it to /usr/local and change the owner to hduser.
cd /usr/local
$ sudo tar xzf hadoop-2.2.0.tar.gz
$ sudo mv hadoop-2.2.0 hadoop
$ sudo chown -R hduser:hadoop hadoop

  1. Configuring Hadoop:

    The following are the required files we will use for the perfect configuration of the single node Hadoop cluster.
    1. yarn-site.xml
    2. core-site.xml
    3. mapred-site.xml
    4. hdfs-site.xml


    A. YARN-SITE.XML:
    Open this file by using below command:

    sudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml

    Copy Paste the below code between the snippets: <configuration> ... </configuration>
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>


    B. CORE-SITE.XML
    Open this file by using below command:

    sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml

    Copy Paste the below code between the snippets: <configuration> ... </configuration>
    <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
    </property>
    <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:54310</value>
    </property>


    C. MAPRED-SITE.XML
    If this file does not exist, copy mapred-site.xml.template as mapred-site.xml. Open this file by using below command:

    sudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml

    Copy Paste the below code between the snippets: <configuration> ... </configuration>
    <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
    </property>


    D. HDFS-SITE.XML
    Create two directories to be used by namenode and datanode.
    sudo mkdir -p /app/hadoop/tmp/dfs/name
    sudo mkdir -p /app/hadoop/tmp/dfs/data

    Open hdfs-site.xml file by using below command:

    sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

    Copy Paste the below code between the snippets: <configuration> ... </configuration>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/app/hadoop/tmp/dfs/name</value>
    </property>
    <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/app/hadoop/tmp/dfs/data</value>
    </property>


  2. Update $HOME/.bashrc

    su - hduser
    cd $HOME
    ls -la ~/ | more
    sudo gedit ~/.bashrc


    Copy Paste the below code at the end of this file

    # Set Hadoop-related environment variables
    export HADOOP_PREFIX=/usr/local/hadoop
    export HADOOP_HOME=/usr/local/hadoop
    export HADOOP_MAPRED_HOME=${HADOOP_HOME}
    export HADOOP_COMMON_HOME=${HADOOP_HOME}
    export HADOOP_HDFS_HOME=${HADOOP_HOME}
    export YARN_HOME=${HADOOP_HOME}
    export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
    # Native Path
    export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
    #Java path
    export JAVA_HOME='/usr/local/java/jdk1.8.0'
    # Add Hadoop bin/ directory to PATH
    export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin


  3. Formatting and Starting the HDFS filesystem via the NameNode:

    The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your cluster. You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS).
    /usr/local/hadoop/bin/hadoop namenode - format


    Start Hadoop Daemons by running the following commands:
    /usr/local/hadoop/sbin/start-all.sh

    OR you can separately start required services as below:

    Name node:
    /usr/local/hadoop/sbin/hadoop-daemon.sh start namenode

    Data node:
    /usr/local/hadoop/sbin/hadoop-daemon.sh start datanode

    Resource Manager:
    /usr/local/hadoop/sbin/yarn-daemon.sh start resourcemanager

    Node Manager:
    /usr/local/hadoop/sbin/yarn-daemon.sh start nodemanager

    Job History Server:
    /usr/local/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver

    When you start the hadoop services, you will see something like this:



    You can check the process id by using JPS as shown below:












    OR you can check TCP and port details by using,
    sudo netstat -plten | grep java














  4. Hadoop Web Interfaces:
    Hadoop comes with several web interfaces which are by default available at these locations:

    HDFS Namenode and check health using http://localhost:50070
    HDFS Secondary Namenode status using http://localhost:50030

























  5. Stopping Hadoop

    Stop Hadoop by running the following command:
    /usr/local/hadoop/sbin/stop-all.sh

    OR
    Stop individual services by following commands:
    /usr/local/hadoop/sbin/stop-dfs.sh
    /usr/local/hadoop/sbin/stop-yarn.sh