Hadoop: Hadoop 2.2.0 (YARN) Single Node Cluster (Pseudo-Distributed Operation) setup

Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.
For more info on Hadoop, Click here

In this tutorial, we will set up Hadoop in pseudo-distributed mode, single-node cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.

Section 1: Prerequisites:

Installing Java v1.8
Adding dedicated Hadoop system user.
Configuring SSH access.
Disabling IPv6.

A. Installing Java v1.8

Completely remove the OpenJDK/JRE from your system and create a directory to hold your Oracle Java JDK/JRE binaries.
- Check if you have Java installed on your system.
  
  java -version
- If you have Open JDK installed on your system, completely remove it using this:
  
  sudo apt-get purge openjdk-\*
- If you have Oracle Java already installed on your system, you may skip first part.
- Create Directory where you will install Oracle Java.
  
  sudo mkdir -p /usr/local/java
Check to see if your Ubuntu Linux operating system architecture is 32-bit or 64-bit, open up a terminal and run the following command below.

file /sbin/init
- Note the bit version of your Ubuntu Linux operating system architecture it will display whether it is 32-bit or 64-bit. Download JDK 32/64 -bit from here.
- Important Information: 64-bit Oracle Java binaries do not work on 32-bit Ubuntu Linux operating systems, you will receive multiple system error messages, if you attempt to install 64-bit Oracle Java on 32-bit Ubuntu Linux.

Copy the Oracle Java binaries into the /usr/local/java directory.

cd /home/"your_user_name"/Downloads
sudo cp -r jdk-8u5-linux-x64.tar.gz /usr/local/java/

Unpack the compressed Java binaries, in the directory /usr/local/java.

sudo tar xvzf jdk-8u5-linux-i586.tar.gz

Edit the system PATH file /etc/profile and add the following system variables to your system path.

sudo gedit /etc/profile

Type/Copy/Paste this in the file:

JAVA_HOME=/usr/local/java/jdk1.8.0_05
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export JAVA_HOME
export PATH

Now save and exit.

(a). Inform your Ubuntu Linux system where your Oracle Java JDK/JRE is located.
- Type/Copy/Paste:
  
  sudo update-alternatives --install "/usr/bin/java" "java" "/usr/local/java/jdk1.8.0_05/bin/java" 1
  
  This command notifies the system that Oracle Java JRE is available for use.
- Type/Copy/Paste:
  
  sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/local/java/jdk1.8.0_05/bin/javac" 1
  
  This command notifies the system that Oracle Java JDK is available for use.
- Type/Copy/Paste:
  
  sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/local/java/jdk1.8.0_05/bin/javaws" 1
  
  This command notifies the system that Oracle Java Web start is available for use.
(b). Inform your Ubuntu Linux system that Oracle Java JDK/JRE must be the default Java.
- Type/Copy/Paste:
  
  sudo update-alternatives --set java /usr/local/java/jdk1.8.0_05/bin/java
  
  This command will set the java runtime environment for the system.
- Type/Copy/Paste:
  
  sudo update-alternatives --set javac /usr/local/java/jdk1.8.0_05/bin/javac
  
  This command will set the javac compiler for the system.
- Type/Copy/Paste:
  
  sudo update-alternatives --set javaws /usr/local/java/jdk1.8.0_05/bin/javaws
  
  This command will set Java Web start for the system.
- Reload your system wide PATH /etc/profile by typing the following command: Type/Copy/Paste:
  
  source /etc/profile
- Note your system-wide PATH /etc/profile file will reload after reboot of your Ubuntu Linux system.
>> To check if installation is successful, Type/Copy/Paste:

java -version

>> This command displays the version of java running on your system. You should receive a message which displays:

java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) Server VM (build 24.51-b03, mixed mode)

>> Now Type/Copy/Paste:

javac -version

This command lets you know that you are now able to compile Java programs from the terminal. You should receive a message which displays:

javac 1.8.0_05

B. Adding dedicated Hadoop system user

Adding group:

sudo addgroup Hadoop
Creating a user and adding the user to a group:

sudo adduser –ingroup Hadoop hduser

C. Configuring SSH access

Type/Copy/Paste:

sudo apt-get install openssh-server
Before starting of installing any applications or softwares, please makes sure your list of packages from all repositories and PPA’s is up to date or if not update them by using this command:

sudo apt-get update
The need for SSH Key based authentication is required so that the master node can then login to slave nodes (and the secondary node) to start/stop them and also local machine if you want to use Hadoop with it. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.

Before this step you have to make sure that SSH is up and running on your machine and configured it to allow SSH public key authentication.

Generating an SSH key for the hduser user.

a. Login as hduser with sudo

b. Run this Key generation command:

ssh-keygen -t rsa -P ""

c. It will ask to provide the file name in which to save the key, just press has entered so that it will generate the key at ‘/home/hduser/ .ssh’

d. Enable SSH access to your local machine with this newly created key.

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

e. The final step is to test the SSH setup by connecting to your local machine with the hduser user.

ssh hduser@localhost

This will add localhost permanently to the list of known hosts.
You will get similar result as in below picture.

D. Disabling IPv6

We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations. You will need to run the following commands using a root account:

sudo gedit /etc/sysctl.conf

Add the following lines to the end of the file and reboot the machine, to update the configurations correctly.

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Section 2: Hadoop Installation:

Download Hadoop stable version from the Apache Download Mirrors.
Move it to /usr/local and change the owner to hduser.

cd /usr/local
$ sudo tar xzf hadoop-2.2.0.tar.gz
$ sudo mv hadoop-2.2.0 hadoop
$ sudo chown -R hduser:hadoop hadoop

Configuring Hadoop:

The following are the required files we will use for the perfect configuration of the single node Hadoop cluster.

yarn-site.xml
core-site.xml
mapred-site.xml
hdfs-site.xml

A. YARN-SITE.XML:
Open this file by using below command:

sudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml

Copy Paste the below code between the snippets: <configuration> ... </configuration>

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

B. CORE-SITE.XML
Open this file by using below command:

sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml

Copy Paste the below code between the snippets: <configuration> ... </configuration>

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>

C. MAPRED-SITE.XML
If this file does not exist, copy mapred-site.xml.template as mapred-site.xml. Open this file by using below command:

sudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml

Copy Paste the below code between the snippets: <configuration> ... </configuration>

<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>

D. HDFS-SITE.XML
Create two directories to be used by namenode and datanode.

sudo mkdir -p /app/hadoop/tmp/dfs/name
sudo mkdir -p /app/hadoop/tmp/dfs/data

Open hdfs-site.xml file by using below command:

sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Copy Paste the below code between the snippets: <configuration> ... </configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/app/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/app/hadoop/tmp/dfs/data</value>
</property>

Update $HOME/.bashrc

su - hduser
cd $HOME
ls -la ~/ | more
sudo gedit ~/.bashrc

Copy Paste the below code at the end of this file

# Set Hadoop-related environment variables
export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# Native Path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
#Java path
export JAVA_HOME='/usr/local/java/jdk1.8.0'
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin

Formatting and Starting the HDFS filesystem via the NameNode:

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your cluster. You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS).

/usr/local/hadoop/bin/hadoop namenode - format

Start Hadoop Daemons by running the following commands:

/usr/local/hadoop/sbin/start-all.sh

OR you can separately start required services as below:

Name node:

/usr/local/hadoop/sbin/hadoop-daemon.sh start namenode

Data node:

/usr/local/hadoop/sbin/hadoop-daemon.sh start datanode

Resource Manager:

/usr/local/hadoop/sbin/yarn-daemon.sh start resourcemanager

Node Manager:

/usr/local/hadoop/sbin/yarn-daemon.sh start nodemanager

Job History Server:

/usr/local/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver

When you start the hadoop services, you will see something like this:

You can check the process id by using JPS as shown below:

OR you can check TCP and port details by using,

sudo netstat -plten | grep java

Hadoop Web Interfaces:
Hadoop comes with several web interfaces which are by default available at these locations:

HDFS Namenode and check health using http://localhost:50070
HDFS Secondary Namenode status using http://localhost:50030

Stopping Hadoop

Stop Hadoop by running the following command:

/usr/local/hadoop/sbin/stop-all.sh

OR
Stop individual services by following commands:

/usr/local/hadoop/sbin/stop-dfs.sh
/usr/local/hadoop/sbin/stop-yarn.sh

Hadoop

Wednesday, June 25, 2014

Hadoop 2.2.0 (YARN) Single Node Cluster (Pseudo-Distributed Operation) setup

Section 1: Prerequisites:

Section 2: Hadoop Installation:

1 comment: