Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.
For more info on Hadoop, Click here
In this tutorial, we will set up Hadoop in pseudo-distributed mode, single-node cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.
A. Installing Java v1.8
C. Configuring SSH access
D. Disabling IPv6
We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations. You will need to run the following commands using a root account:
Add the following lines to the end of the file and reboot the machine, to update the configurations correctly.
Download Hadoop stable version from the Apache Download Mirrors.
Move it to /usr/local and change the owner to hduser.
For more info on Hadoop, Click here
In this tutorial, we will set up Hadoop in pseudo-distributed mode, single-node cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.
Section 1: Prerequisites:
- Installing Java v1.8
- Adding dedicated Hadoop system user.
- Configuring SSH access.
- Disabling IPv6.
A. Installing Java v1.8
- Completely remove the OpenJDK/JRE from your system and create a directory to hold your Oracle Java JDK/JRE binaries.
- Check if you have Java installed on your system.
java -version -
If you have Open JDK installed on your system, completely remove it using this:
sudo apt-get purge openjdk-\* - If you have Oracle Java already installed on your system, you may skip first part.
-
Create Directory where you will install Oracle Java.
sudo mkdir -p /usr/local/java
- Check if you have Java installed on your system.
-
Check to see if your Ubuntu Linux operating system architecture is 32-bit or 64-bit, open up a terminal and run the following command below.
file /sbin/init
- Note the bit version of your Ubuntu Linux operating system architecture it will display whether it is 32-bit or 64-bit.
Download JDK 32/64 -bit from here.
- Important Information: 64-bit Oracle Java binaries do not work on 32-bit Ubuntu Linux operating systems, you will receive multiple system error messages, if you attempt to install 64-bit Oracle Java on 32-bit Ubuntu Linux.
- Note the bit version of your Ubuntu Linux operating system architecture it will display whether it is 32-bit or 64-bit.
Download JDK 32/64 -bit from here.
-
Copy the Oracle Java binaries into the /usr/local/java directory.
cd /home/"your_user_name"/Downloads
sudo cp -r jdk-8u5-linux-x64.tar.gz /usr/local/java/ -
Unpack the compressed Java binaries, in the directory /usr/local/java.
sudo tar xvzf jdk-8u5-linux-i586.tar.gz -
Edit the system PATH file /etc/profile and add the following system variables to your system path.
sudo gedit /etc/profile
Type/Copy/Paste this in the file:
JAVA_HOME=/usr/local/java/jdk1.8.0_05
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export JAVA_HOME
export PATH
Now save and exit. -
(a). Inform your Ubuntu Linux system where your Oracle Java JDK/JRE is located.
- Type/Copy/Paste:
sudo update-alternatives --install "/usr/bin/java" "java" "/usr/local/java/jdk1.8.0_05/bin/java" 1
This command notifies the system that Oracle Java JRE is available for use.
- Type/Copy/Paste:
sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/local/java/jdk1.8.0_05/bin/javac" 1
This command notifies the system that Oracle Java JDK is available for use.
- Type/Copy/Paste:
sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/local/java/jdk1.8.0_05/bin/javaws" 1
This command notifies the system that Oracle Java Web start is available for use.
- Type/Copy/Paste:
sudo update-alternatives --set java /usr/local/java/jdk1.8.0_05/bin/java
This command will set the java runtime environment for the system. - Type/Copy/Paste:
sudo update-alternatives --set javac /usr/local/java/jdk1.8.0_05/bin/javac
This command will set the javac compiler for the system.
- Type/Copy/Paste:
sudo update-alternatives --set javaws /usr/local/java/jdk1.8.0_05/bin/javaws
This command will set Java Web start for the system. -
Reload your system wide PATH /etc/profile by typing the following command:
Type/Copy/Paste:
source /etc/profile - Note your system-wide PATH /etc/profile file will reload after reboot of your Ubuntu Linux system.
java -version
>> This command displays the version of java running on your system. You should receive a message which displays:
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) Server VM (build 24.51-b03, mixed mode)
>> Now Type/Copy/Paste:javac -version
This command lets you know that you are now able to compile Java programs from the terminal. You should receive a message which displays:javac 1.8.0_05 - Type/Copy/Paste:
- Adding group:
sudo addgroup Hadoop - Creating a user and adding the user to a group:
sudo adduser –ingroup Hadoop hduser
C. Configuring SSH access
- Type/Copy/Paste:
sudo apt-get install openssh-server
- Before starting of installing any applications or softwares, please makes sure your list of packages from all repositories and PPA’s is up to date or if not update them by using this command:
sudo apt-get update
- The need for SSH Key based authentication is required so that the master node can then login to slave nodes (and the secondary node) to start/stop them and also local machine if you want to use Hadoop with it. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.
Before this step you have to make sure that SSH is up and running on your machine and configured it to allow SSH public key authentication.
Generating an SSH key for the hduser user.
a. Login as hduser with sudo
b. Run this Key generation command:
ssh-keygen -t rsa -P ""
c. It will ask to provide the file name in which to save the key, just press has entered so that it will generate the key at ‘/home/hduser/ .ssh’
d. Enable SSH access to your local machine with this newly created key.
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
e. The final step is to test the SSH setup by connecting to your local machine with the hduser user.
ssh hduser@localhost
This will add localhost permanently to the list of known hosts.
You will get similar result as in below picture.
D. Disabling IPv6
We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations. You will need to run the following commands using a root account:
sudo gedit /etc/sysctl.conf |
Add the following lines to the end of the file and reboot the machine, to update the configurations correctly.
#disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 |
Section 2: Hadoop Installation:
Download Hadoop stable version from the Apache Download Mirrors.
Move it to /usr/local and change the owner to hduser.
cd /usr/local $ sudo tar xzf hadoop-2.2.0.tar.gz $ sudo mv hadoop-2.2.0 hadoop $ sudo chown -R hduser:hadoop hadoop |
- Configuring Hadoop:
The following are the required files we will use for the perfect configuration of the single node Hadoop cluster.- yarn-site.xml
- core-site.xml
- mapred-site.xml
- hdfs-site.xml
A. YARN-SITE.XML:
Open this file by using below command:
sudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml
Copy Paste the below code between the snippets: <configuration> ... </configuration><property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
B. CORE-SITE.XML
Open this file by using below command:
sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml
Copy Paste the below code between the snippets: <configuration> ... </configuration><property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
C. MAPRED-SITE.XML
If this file does not exist, copy mapred-site.xml.template as mapred-site.xml. Open this file by using below command:
sudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml
Copy Paste the below code between the snippets: <configuration> ... </configuration><property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
D. HDFS-SITE.XML
Create two directories to be used by namenode and datanode.sudo mkdir -p /app/hadoop/tmp/dfs/name
sudo mkdir -p /app/hadoop/tmp/dfs/data
Open hdfs-site.xml file by using below command:
sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Copy Paste the below code between the snippets: <configuration> ... </configuration><property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/app/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/app/hadoop/tmp/dfs/data</value>
</property> - Update $HOME/.bashrc
su - hduser
cd $HOME
ls -la ~/ | more
sudo gedit ~/.bashrc
Copy Paste the below code at the end of this file
# Set Hadoop-related environment variables
export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# Native Path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
#Java path
export JAVA_HOME='/usr/local/java/jdk1.8.0'
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin - Formatting and Starting the HDFS filesystem via the NameNode:
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your cluster. You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS).
/usr/local/hadoop/bin/hadoop namenode - format
Start Hadoop Daemons by running the following commands:/usr/local/hadoop/sbin/start-all.sh
OR you can separately start required services as below:
Name node:
/usr/local/hadoop/sbin/hadoop-daemon.sh start namenode
Data node:
/usr/local/hadoop/sbin/hadoop-daemon.sh start datanode
Resource Manager:
/usr/local/hadoop/sbin/yarn-daemon.sh start resourcemanager
Node Manager:
/usr/local/hadoop/sbin/yarn-daemon.sh start nodemanager
Job History Server:
/usr/local/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver
When you start the hadoop services, you will see something like this:
You can check the process id by using JPS as shown below:
OR you can check TCP and port details by using,sudo netstat -plten | grep java
- Hadoop Web Interfaces:
Hadoop comes with several web interfaces which are by default available at these locations:
HDFS Namenode and check health using http://localhost:50070
HDFS Secondary Namenode status using http://localhost:50030
- Stopping Hadoop
Stop Hadoop by running the following command:
/usr/local/hadoop/sbin/stop-all.sh
OR
Stop individual services by following commands:
/usr/local/hadoop/sbin/stop-dfs.sh
/usr/local/hadoop/sbin/stop-yarn.sh