Setting up Apache Hadoop on RHEL6/CentOS 6

Setting up Apache Hadoop on RHEL6/CentOS 6 is simple wiith the recent availability of RPMs for Apache Hadoop it makes it much simpler to setup a basic Hadoop cluster. This will allow you to focus on how to use the features instead of having to learn how they were implemented.

These instructions DO NOT Hadoop settings to make Hadoop fast but it will get you running a Hadoop culster fast. We will leave Hadoop optimization for another day.

Download software or Check that you have the Java JDK installed

Check to see if you have a Java JDK installed. Use the command: javac -version. You should see something like this:

   javac 1.7.0

If you dont have a Java JDK installed see my article here.

Next you should download Apache Hadoop 0.20.204.0 RPM from Apache mirrors. You should also download the Apache Project PUBLICKEY FILE and the .asc file from the Apache Mirror and issue the following commands to verify the KEYS and Signature of the downloaded KEYS and rpm install.

1) Dowload the Apache Project Hadoop KEYS file and verify the .asc file that you downloaded.

% gpg --import KEYS

% gpg --verify
hadoop-0.20.204.0-1.i386.rpm.asc

Import the Apache Projects Public PGP Keys for rpm.

% rpm --import KEYS

Java is installed and set JAVA_HOME to /usr/java/default

2) Install Apache Hadoop 0.20.204.

% rpm -i hadoop-0.20.204.0-1.i386.rpm

3) Setup Apache Hadoop configuration and start Hadoop processes.

% /usr/sbin/hadoop-setup-single-node.sh

The setup wizard will guide you through a list of questions to setup hadoop. Hadoop should be running after answering ‘Y’ to all questions.

4) Create a user account on HDFS for yourself.

% /usr/sbin/hadoop-create-user.sh -u $USER


Multi-nodes setup

1) install both the JDK and Hadoop 0.20.204.0 RPMs on all nodes

2) Generate hadoop configuration on all nodes:

% /usr/sbin/hadoop-setup-conf.sh \
--namenode-url=hdfs://${namenode}:9000/ \
--jobtracker-url=${jobtracker}:9001 \
--conf-dir=/etc/hadoop \
--hdfs-dir=/var/lib/hadoop/hdfs \
--namenode-dir=/var/lib/hadoop/hdfs/namenode \
--mapred-dir=/var/lib/hadoop/mapred \
--datanode-dir=/var/lib/hadoop/hdfs/data \
--log-dir=/var/log/hadoop \
--auto

Where ${namenode} and ${jobtracker} should be replaced with hostname of namenode and jobtracker.

3) Format namenode and setup default HDFS layout.

% /usr/sbin/hadoop-setup-hdfs.sh

4) Start all data nodes.

% /etc/init.d/hadoop-datanode start

5) Start job tracker node.

% /etc/init.d/hadoop-jobtracker start

6) Start task tracker nodes.

% /etc/init.d/hadoop-tasktracker start

7) Create a user account on HDFS for yourself.

% /usr/sbin/hadoop-create-user.sh -u $USER


Verify Hadoop's Operation

Run the word count example.