Introduction

Hadoop is a cluster compute engine written in Java which uses the map-reduce strategy to divide workloads over multiple nodes. In this blog I will write my experiences with installing a minimal system (non-distributed) and running simple jobs on the cluster software.

Hadoop execution modes

Hadoop has several execution modes, we will only be working with the first two. The Local (Standalone) Mode is ment mainly as a debugging mode, the Pseudo-Distributed Mode still works only on the local machine, but should -in theory- work the same as the fully distributed mode, except for the amount of nodes running the job of course. Full documentation can be found in the apache reference manual

  • Local (Standalone) mode
  • Pseudo-Distributed mode
  • Fully-Distributed mode

Standalone mode example on .xml data

The standalone mode prints all operations the Hadoop system performs to stdout. To see it in action run the following code on a default installation of Hadoop:

mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'

An extract of the output this should yield:

17/03/23 14:41:27 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
17/03/23 14:41:27 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
17/03/23 14:41:27 INFO input.FileInputFormat: Total input paths to process : 8
17/03/23 14:41:27 INFO mapreduce.JobSubmitter: number of splits:8
17/03/23 14:41:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local2046457461_0001
17/03/23 14:41:27 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
17/03/23 14:41:27 INFO mapreduce.Job: Running job: job_local2046457461_0001
17/03/23 14:41:27 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/03/23 14:41:27 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/03/23 14:41:27 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/03/23 14:41:27 INFO mapred.LocalJobRunner: Waiting for map tasks
17/03/23 14:41:27 INFO mapred.LocalJobRunner: Starting task: attempt_local2046457461_0001_m_000000_0
17/03/23 14:41:27 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/03/23 14:41:28 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/03/23 14:41:28 INFO mapred.MapTask: Processing split: file:/opt/docker/hadoop-2.7.3/input/hadoop-policy.xml:0+9683

The complete output is here here

Some things you see in action:

  • Splitter (divides the data into parts)
  • Mappers
  • Combiners
  • Reducers

Word-count example in Pseudo-Distributed mode

To switch modes we need to edit etc/hadoop/core-site.xml and etc/hadoop/hdfs-site.xml

core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Create the hadoop filesystem if you haven’t already:

bin/hdfs namenode -format
sbin/start-dfs.sh
bin/hdfs dfs -mkdir -p /user/root

This should yield something like:

Starting namenodes on [localhost]
localhost: starting namenode, logging to /opt/docker/hadoop-2.7.3/logs/hadoop-root-namenode-b78cab376c69.out
localhost: starting datanode, logging to /opt/docker/hadoop-2.7.3/logs/hadoop-root-datanode-b78cab376c69.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/docker/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-b78cab376c69.out

Note: I messed up with the hdfs file system at some time, which appeared to be in a locked state by a non-existent thread (logs are in the $HADOOP_ROOT$/logs directory) The solution was:

  • Stop all data notes: sbin/stop-dfs.sh
  • Remove all data in /tmp: rm -rf /tmp/
  • Re-start name/data nodes: sbin/start-dfs.sh

Get the input data and store it on the data node

wget http://www.gutenberg.org/ebooks/100.txt.utf-8
wget https://gist.githubusercontent.com/WKuipers/87a1439b09d5477d21119abefdb84db0/raw/c327b9f74d30684b1ad2a0087a6de805503379d3/WordCount.java
bin/hdfs dfs -mkdir input
bin/hdfs dfs -put 100.txt.utf-8 input

Compile the .jar file

export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
bin/hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class

Run it

bin/hadoop jar wc.jar WordCount input output
bin/hdfs dfs -get output/part-r-00000
bin/hdfs dfs -rm -r output
cat part-r-00000

Sample (partial) output:

"Air,"	1
"Alas,	1
"Amen"	2
"Amen"?	1
"Amen,"	1
"And	1
"Aroint	1
"B	1
"Black	1
"Break	1
"Brutus"	1
"Brutus,	2
"C	1
"Caesar"?	1

The the complete part-r-00000 file

Mapreduce for line-count or character-count:

Modify the .java script here to emit 1 for each character or line with the same key for each instance. The summation (in the reducer) is the same.

Averaging requires a more elaborate reducer which keeps track of the total of (key/values) submitted.

Appearance of Romeo or Juliet

Romeo and Juliet get mentioned a lot by other people aswell, this does not necessarily correlate with appearance. Probably better to filter to only the first word of each sentence.