Tutorial

Background

We have adapted the Pseudo-Distributed Cluster example from the Hadoop documentation and made it easier to execute. Glance over these Hadoop documentation, with the lectures in the back of your mind, but then follow the instructions below.

Create, start and attach to a container:

docker create --name hello-hadoop -it -p 8088:8088 -p 9870:9870 rubigdata/course:a2
docker start hello-hadoop
docker attach hello-hadoop

Where am I, what is going on?

whoami
pwd
ls
ps -ef
printenv PATH

In other words, the container starts as user hadoop in the directory where Hadoop has been installed, and a sshd (ssh server) has been started. The Hadoop directories are included on the path, so unlike in the online tutorial text, you do not have to prefix commands with bin/ and sbin/, the shell will find them.

Tutorial

HDFS

Create a distributed file system and start a cluster:

hdfs namenode -format
start-dfs.sh

If you later reuse this container (the container named hello-hadoop, not the image!) you do not need to create the file system again, and you’d startup the cluster right away.

Create your (we are logged in as user hadoop) home directory, and upload some files from the local filesystem (in the container) to the distributed filesystem (in this case, also in the container - but keep in mind that they are very different filesystems, one using a default Linux or Windows filesystem like ext4 or NTFS locally, and the other one running HDFS on multiple machines; check again the processes that run, ps -ef, to see how many datanodes we have).

hdfs dfs -mkdir -p /user/hadoop
hdfs dfs -mkdir input
hdfs dfs -put etc/hadoop/*.xml input

Provided that you used the container creation command above that exposes port 9870 using -p 9870:9870, you can access the namenode Web UI in your browser.

Yarn

On production clusters, you never run jobs by hand, but use a special service that manages the cluster resources. The main motivation here is to enable resource sharing among a team of data scientists who all want to claim disk and memory for their own analyses.

The default choice since Hadoop 2 is yarn, yet another resource negotiator. You start and stop yarn using start-yarn.sh and stop-yarn.sh, and issue Map Reduce jobs with yarn jar.

So, start the resource manager:

start-yarn.sh

Map Reduce

You submit a Map Reduce job to yarn as follows:

yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar grep input output 'dfs[a-z.]+'

To really understand what is happening, you will have to look into the Map Reduce program that is executed, but let us first complete this walk-through before we look into the details. For now, it is sufficient to realize that grep applies a regular expression over the input data, and returns the entries that match the regular expression (dfs[a-z.]+) (a task that you can also achieve with the UNIX command-line utility grep, probably familiar).

When the program has completed, you have to move the results from the distributed filesystem onto your local filesystem before you can access them for further analysis:

hdfs dfs -get output output
cat output/*

Always be careful when copying results from HDFS as they can be much larger than your local filesystem! In this specific case, where you know you have small input and therefore also small output sizes (in my case, the log states Bytes Written=29), it is safe to do so; you could even usefully apply hdfs dfs -cat output/* instead.

You may want to run more Hadoop examples, I like the one that estimates the value of 𝛑 with a (quasi-) Monte Carlo simulation:

yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar pi 16 10000

If you increase the number of map tasks (from 16) and/or their sample size (from 10000), you will improve the estimation but the code runs longer. This does give a good opportunity to inspect the job running with the resource manager Web UI, exposed at port 8088. If you created the container with the -p 8088:8088 parameter, you can access the Web UI in your browser. The resource manager tells you the cluster state, the applications that run and have run, etc.

Note: because we do not have a real cluster (only a pseudo-distributed one), links to the application logs will not work correctly.

Wrap up

Now read up on the other command options to hdfs dfs, online at the Hadoop documentation site, or in your container (press ‘q’ to exit less):

hdfs dfs -help | less

When you have seen enough, stop yarn and the HDFS cluster:

stop-yarn.sh
stop-dfs.sh

You may then deattach the container by pressing ^D.

When you restart the container later, you may have to restart the SSH daemon using sudo /opt/ssh/sbin/sshd, check ps -ef to find out if you have to. Next, you restart the cluster using start-dfs.sh and start-yarn.sh.

If you want to close the terminal but keep the container running in the background until you re-attach, you can also use the special key sequence ^p^q (but do re-attach and stop yarn and HDFS gracefully before you shutdown the computer, or the container may end up in an undefined state).

Should you really want to install additional software on the container, the basis is a ubi8/ubi-minimal container where you install software with microdnf. Because you run as user hadoop, you need to sudo and enter rubigdata2021 as a password.

Never hesitate to ask for help in the Matrix room if you get stuck!

Back to Assignment / Two.