First experiences with Hadoop and the Hadoop filesystem
Introduction
Hadoop is a cluster compute engine written in Java which uses the map-reduce strategy to divide workloads over multiple nodes. In this blog I will write my experiences with installing a minimal system (non-distributed) and running simple jobs on the cluster software.
Hadoop execution modes
Hadoop has several execution modes, we will only be working with the first two. The Local (Standalone) Mode is ment mainly as a debugging mode, the Pseudo-Distributed Mode still works only on the local machine, but should -in theory- work the same as the fully distributed mode, except for the amount of nodes running the job of course. Full documentation can be found in the apache reference manual
- Local (Standalone) mode
- Pseudo-Distributed mode
- Fully-Distributed mode
Standalone mode example on .xml data
The standalone mode prints all operations the Hadoop system performs to stdout. To see it in action run the following code on a default installation of Hadoop:
mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
An extract of the output this should yield:
17/03/23 14:41:27 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
17/03/23 14:41:27 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
17/03/23 14:41:27 INFO input.FileInputFormat: Total input paths to process : 8
17/03/23 14:41:27 INFO mapreduce.JobSubmitter: number of splits:8
17/03/23 14:41:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local2046457461_0001
17/03/23 14:41:27 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
17/03/23 14:41:27 INFO mapreduce.Job: Running job: job_local2046457461_0001
17/03/23 14:41:27 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/03/23 14:41:27 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/03/23 14:41:27 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/03/23 14:41:27 INFO mapred.LocalJobRunner: Waiting for map tasks
17/03/23 14:41:27 INFO mapred.LocalJobRunner: Starting task: attempt_local2046457461_0001_m_000000_0
17/03/23 14:41:27 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/03/23 14:41:28 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
17/03/23 14:41:28 INFO mapred.MapTask: Processing split: file:/opt/docker/hadoop-2.7.3/input/hadoop-policy.xml:0+9683
The complete output is here here
Some things you see in action:
- Splitter (divides the data into parts)
- Mappers
- Combiners
- Reducers
Word-count example in Pseudo-Distributed mode
To switch modes we need to edit etc/hadoop/core-site.xml
and etc/hadoop/hdfs-site.xml
core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Create the hadoop filesystem if you haven’t already:
bin/hdfs namenode -format
sbin/start-dfs.sh
bin/hdfs dfs -mkdir -p /user/root
This should yield something like:
Starting namenodes on [localhost]
localhost: starting namenode, logging to /opt/docker/hadoop-2.7.3/logs/hadoop-root-namenode-b78cab376c69.out
localhost: starting datanode, logging to /opt/docker/hadoop-2.7.3/logs/hadoop-root-datanode-b78cab376c69.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/docker/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-b78cab376c69.out
Note: I messed up with the hdfs file system at some time, which appeared to be in a locked state by a non-existent thread (logs are in the $HADOOP_ROOT$/logs directory) The solution was:
- Stop all data notes:
sbin/stop-dfs.sh
- Remove all data in /tmp:
rm -rf /tmp/
- Re-start name/data nodes:
sbin/start-dfs.sh
Get the input data and store it on the data node
wget http://www.gutenberg.org/ebooks/100.txt.utf-8
wget https://gist.githubusercontent.com/WKuipers/87a1439b09d5477d21119abefdb84db0/raw/c327b9f74d30684b1ad2a0087a6de805503379d3/WordCount.java
bin/hdfs dfs -mkdir input
bin/hdfs dfs -put 100.txt.utf-8 input
Compile the .jar file
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
bin/hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
Run it
bin/hadoop jar wc.jar WordCount input output
bin/hdfs dfs -get output/part-r-00000
bin/hdfs dfs -rm -r output
cat part-r-00000
Sample (partial) output:
"Air," 1
"Alas, 1
"Amen" 2
"Amen"? 1
"Amen," 1
"And 1
"Aroint 1
"B 1
"Black 1
"Break 1
"Brutus" 1
"Brutus, 2
"C 1
"Caesar"? 1
The the complete part-r-00000 file
Mapreduce for line-count or character-count:
Modify the .java script here to emit 1 for each character or line with the same key for each instance. The summation (in the reducer) is the same.
Averaging requires a more elaborate reducer which keeps track of the total of (key/values) submitted.
Appearance of Romeo or Juliet
Romeo and Juliet get mentioned a lot by other people aswell, this does not necessarily correlate with appearance. Probably better to filter to only the first word of each sentence.