How to setup Hadoop 2.4.0 on Mac OS X 10.9 Mavericks
Brief
Hadoop is an open-source Apache project that enables processing of extremely large datasets in a distributed computing environment. It can be run in three different modes:
Standalone Mode
Hadoop runs everything in a single JVM with no daemons. This mode is only suitable for testing and debugging MapReduce programs during development.
Pseudodistributed Mode
Hadoop daemons run on the local machine, simulating a small cluster.
Fully Distributed Mode
Hadoop daemons run on a cluster of machines.
This tutorial covers setting up Hadoop 2.4.0 stable in a Pseudodistributed Mode. Before getting started with the installation and configuration of Hadoop, there are some prerequisites.
Requirements
Java version 1.6.* or higher is required for Hadoop. Running the following command will prompt you for installation if you don’t already have Java installed:
KIWI-GDEF@:~ sungsoo$ java -version
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
Homebrew. Though we can go without it, Homebrew will make installing Hadoop on a Mac significantly easier:
KIWI-GDEF@:~ sungsoo$ ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"
Remote Login Setting
check “Remote Login” of “Sharing” in System Preference.
SSH keys.First, ensure Remote Login under System Preferences -> Sharing is checked to enable SSH. If you have SSH keys already setup, ssh into your localhost machine. If you don’t, set those bad boys up:
KIWI-GDEF@:~ sungsoo$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
To authorize your public key and avoid being asked for a password every time you ssh into localhost:
KIWI-GDEF@:~ sungsoo$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now ssh into your localhost and allow authorization
KIWI-GDEF@:~ sungsoo$ ssh localhost
Installation
First, you should download Hadoop 2.4.0 distribution binary from the http://hadoop.apache.org/releases.html :
You can download Hadoop 2.4.0 source from the same site. You want a specific version of Hadoop, you may visit http://hadoop.apache.org/releases.html and download the release of your choice. Unpack the .tar to the location of your choice and assign ownership to the user setting up Hadoop.
Then, copy the unpacked files to /usr/local/hadoop.
KIWI-GDEF@:~ sungsoo$ cd /usr/local
KIWI-GDEF@:~ sungsoo$ sudo mkdir hadoop
KIWI-GDEF@:~ sungsoo$ sudo chown sungsoo hadoop
Settings in .bash_profile
- export $HADOOP_HOME environment varaible in .bash_profle file.
export $HADOOP_HOME=/usr/local/hadoop
- set executable path for hadoop (/bin and /sbin).
export PATH=./:/opt/local/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$PATH:
Configuration
Every component of Hadoop is configured using an XML file specifically located in $HADOOP_HOME/etc/hadoop . MapReduce properties go in mapred-site.xml, HDFS properties in hdfs-site.xmland common properties in core-site.xml. The general Hadoop environment properties are found in hadoop-env.sh.
hadoop-env.sh
Assuming Homebrew was used to install Hadoop, add the following line in hadoop-env.sh after line
“# export HADOOP_OPTS=-server”
export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"
If Homebrew was not used, you have to add the following line as well:
export JAVA_HOME=/Library/Java/Home
core-site.xml
Note: fs.default.name value is set to localhost currently for
development purposes. If you’re setting up multiple nodes on your
network, you will have to set the value to
hdfs://
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
- make /usr/local/hadoop/tmp folder.
KIWI-GDEF@:~ sungsoo$ cd /usr/local/hadoop
KIWI-GDEF@:~ sungsoo$ sudo mkdir tmp
KIWI-GDEF@:~ sungsoo$ sudo chown sungsoo tmp
hdfs-site.xml
The Hadoop Distributed File System properties go in this config file. Since we are only setting up one node, we set the value of dfs.replication to 1.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml
The map-reduce config below sets the job tracker port connection port.
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
Almost Ready!
We must format the newly installed HDFS before we can start running the daemons. Formatting creates an empty filesystem by creating storage directories and initial metadata.
$ hadoop namenode -format
Unleash the Daemons
Make sure you are still ssh’d into localhost. You can start HDFS by:
$ start-dfs.sh
and start YARN by:
$ start-yarn.sh
or alternatively, start all:
$ start-all.sh
You now have Hadoop installed! Try running an example!
KIWI-GDEF@:~ sungsoo$ cd /usr/local/hadoop/libexec
KIWI-GDEF@:~ sungsoo$ hadoop jar hadoop-examples-1.2.1.jar pi 10 100
Monitoring
You can monitor your NameNode and DataNode:
- Namenode Information : http://localhost:50070
Finally, run the following command to stop all daemons:
$ stop-dfs.sh
$ stop-yarn.sh