We have adapted the Pseudo-Distributed Cluster example from the Hadoop documentation and made it easier to execute. Glance over these Hadoop documentation, with the lectures in the back of your mind, but then follow the instructions below.
Create, start and attach to a container:
docker create --name hello-hadoop -it -p 8088:8088 -p 9870:9870 rubigdata/course:a2 docker start hello-hadoop docker attach hello-hadoop
Where am I, what is going on?
whoami pwd ls ps -ef printenv PATH
In other words, the container starts as user
hadoop in the directory
where Hadoop has been installed, and a
sshd (ssh server) has been
started. The Hadoop directories are included on the path, so unlike in
the online tutorial text, you do not have to prefix commands with
sbin/, the shell will find them.
Create a distributed file system and start a cluster:
hdfs namenode -format start-dfs.sh
If you later reuse this container (the container named hello-hadoop, not the image!) you do not need to create the file system again, and you’d startup the cluster right away.
Create your (we are logged in as user
hadoop) home directory, and
upload some files from the local filesystem (in the container) to the
distributed filesystem (in this case, also in the container - but keep
in mind that they are very different filesystems, one using a default
Linux or Windows filesystem like
NTFS locally, and the
other one running
HDFS on multiple machines; check again the
processes that run,
ps -ef, to see how many datanodes we have).
hdfs dfs -mkdir -p /user/hadoop hdfs dfs -mkdir input hdfs dfs -put etc/hadoop/*.xml input
Provided that you used the container creation command above that
exposes port 9870 using
-p 9870:9870, you can access the namenode
Web UI in your browser.
On production clusters, you never run jobs by hand, but use a special service that manages the cluster resources. The main motivation here is to enable resource sharing among a team of data scientists who all want to claim disk and memory for their own analyses.
The default choice since Hadoop 2 is
yarn, yet another
resource negotiator. You start and stop yarn using
stop-yarn.sh, and issue Map Reduce jobs with
So, start the resource manager:
You submit a Map Reduce job to
yarn as follows:
yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar grep input output 'dfs[a-z.]+'
To really understand what is happening, you will have to look into
the Map Reduce program that is executed, but let us first complete this
walk-through before we look into the details. For now, it is
sufficient to realize that
grep applies a regular expression over
the input data, and returns the entries that match the regular
dfs[a-z.]+) (a task that you can also achieve with the
UNIX command-line utility
grep, probably familiar).
When the program has completed, you have to move the results from the distributed filesystem onto your local filesystem before you can access them for further analysis:
hdfs dfs -get output output cat output/*
Always be careful when copying results from
HDFS as they can be
much larger than your local filesystem! In this specific case, where
you know you have small input and therefore also small output sizes
(in my case, the log states
Bytes Written=29), it is safe to do so;
you could even usefully apply
hdfs dfs -cat output/* instead.
You may want to run more Hadoop examples, I like the one that estimates the value of 𝛑 with a (quasi-) Monte Carlo simulation:
yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar pi 16 10000
If you increase the number of map tasks (from 16) and/or their sample
size (from 10000), you will improve the estimation but the code runs
longer. This does give a good opportunity to inspect the job running
with the resource manager Web UI, exposed at port 8088. If you
created the container with the
-p 8088:8088 parameter, you can
access the Web UI in your browser. The
resource manager tells you the cluster state, the applications that
run and have run, etc.
Note: because we do not have a real cluster (only a pseudo-distributed one), links to the application logs will not work correctly.
Now read up on the other command options to
hdfs dfs, online at the
Hadoop documentation site, or in your container (press ‘q’ to exit
hdfs dfs -help | less
When you have seen enough, stop
yarn and the HDFS cluster:
You may then deattach the container by pressing
When you restart the container later, you may have to restart the SSH
sudo /opt/ssh/sbin/sshd, check
ps -ef to find out if
you have to. Next, you restart the cluster using
If you want to close the terminal but keep the container running in
the background until you re-attach, you can also use the special key
^p^q (but do re-attach and stop yarn and HDFS gracefully
before you shutdown the computer, or the container may end up in an
Should you really want to install additional software on the
container, the basis is a
ubi8/ubi-minimal container where you
install software with
microdnf. Because you run as user
you need to
sudo and enter
rubigdata2021 as a password.
Never hesitate to ask for help in the Matrix room if you get stuck!