This page is superseded by the new Hadoop Tutorial page!

Map Reduce on HDFS

We have prepared an extended Spark-Notebook Docker container, prepared to run Hadoop on our own “pseudo-cluster”. You will use the docker exec command to start a shell inside the Docker container, and work from that shell.

Setup

First time setup

Run the course’s Docker container and execute a shell, in which we still have to start the ssh service (a shortcoming of our current Docker image).

## Create the container and start it
DID=$(docker create -p 50070:50070 -p 50075-50076:50075-50076 -p9000:9000 rubigdata/hadoop)
docker start $DID

## Execute a shell in the running container
docker exec -it $DID /bin/bash

Note: The above command fails under Powershell (Windows) or, on a Linux laptop, if you did not setup the docker group correctly. You can simply not use the DID environment variable (for Windows) or add sudo (on Linux) - the sudo inside the $() though, not before DID!

The port mappings (-p flags on container creation) allow access to services that run inside the container, from a webbrowser that runs on the host. Port 50070 is the namenode WebUI, ports 50075 and 50076 provide datanode WebUIs. Port 9001 is configured as the IPC port for the namenode, it does not need to be exposed. Port 9000 provides access to spark notebook, which we use in later sessions of the course; I included the mapping for later re-use of the same instructions.

Returning student

If you worked on a container before, you can continue with that same container instead of starting a new one. Docker containers can be started and stopped whenever you want!

Use docker start with the right hash; use docker ps -a to find it.

Of course, this assumes you work on the same terminal PC as previous week. Docker images and containers are stored locally on each machine, not in your homedir under NFS. (PS: If another student works on that PC under Linux, you can simply ssh into it!)

Intermezzo

If you are exploring the source of the Hadoop examples later on in the lab session, I recommend doing these steps on your host machine, and not inside the docker container; much easier for working with your favourite GUI, editors, copy-paste support, etc. etc.

You can exchange files between the host and the container in three different ways:

  1. Use docker cp to copy files between a container and the local filesystem.

  2. Use scp to copy files via lilo (the FNWI LInux LOgin server); your homedir in the terminal room is also mounted through NFS on lilo.

  3. Third option, not possible in the Huygens terminal rooms: Create a directory ${HOME}/bigdata in your homedir that you share with the Docker container using the -v flag to the docker run command.

Pseudo Distributed

cd hadoop-2.9.2

We will now setup a “real” cluster, even though we will only emulate it on our machine (inside the Docker container, actually).

You find the configuration files prepared as etc/hadoop/core-site.xml and etc/hadoop/hdfs-site.xml; inspect them and you notice that replication has been set to one instead of the default of three (Q: why does that make sense in the pseudo cluster scenario?!).

HDFS

Prepare (format) the distributed filesystem:

bin/hdfs namenode -format

Start HDFS and create the user directory; here, I assume you simply work as user root in the Docker container. (If you skipped “Setup passphraseless ssh” you will have to type yes a few times; once for every node in the pseudocluster that we create.)

sbin/start-dfs.sh

bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /user/root

To run map-reduce jobs on a cluster, you first have to copy the data to the HDFS filesystem.

bin/hdfs dfs -put etc/hadoop input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar grep input output 'dfs[a-z.]+'
bin/hdfs dfs -get output output

bin/hdfs dfs -ls hdfs://localhost:9001/user/root/input

Try to understand exactly the effect of each of these commands; what files are you doing what operation on? Extra hint: try grep dfs.class etc/hadoop/* and compare its results with the output you created above.

You can view the web interface for the namenode by opening localhost:50070/ in a browser on your host, because option -p on the docker run command above maps port 50070 inside the container (the namenode) onto the exact same port at the host (that you access in the browser). Note that the localhost:9001 in the hdfs dfs command above refers to port 9001 inside the container!

When trying to figure out what happens exactly, it is good to know that the source of the example programs is included in the release, e.g.,

jar tvf share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.9.2-sources.jar

(You can unpack .jar files (for “java archive”) using jar xf.)

Stop the filesystem gracefully when you are done:

sbin/stop-dfs.sh

See also the Hadoop documentation for running a Pseudo-Distributed Cluster

Adding a second datanode

If you complete the assignments easily, it can be interesting to see how to increase our (pseudo-distributed) cluster by ading a (virtual) datanode. This part of the tutorial is optional - if you struggle through the previous, feel free to skip it.

Basically, follow instructions given in this mail from the hadoop mailing list, updated for the current Hadoop version:

  1. Copy etc/hadoop to etc/hadoop2 and create a new data directory in /tmp, e.g., mkdir /tmp/hadoop-root/dfs/data_02
  2. Edit etc/hadoop2/hadoop-env.sh to define a new datanode name, export HADOOP_IDENT_STRING=${USER}_02
  3. Update etc/hadoop2/hdfs-site.xml with the additional parameters. You can also download my version and copy it into the container.
  4. Finally, we can start the second datanode: bin/hdfs --config etc/hadoop2 datanode

You can see the second datanode appear in the WebUI of the namenode, localhost:50070. Also, if you kill the datanode by pressing ^C, eventually the namenode will notice and the second datanode marked red. Restart it, and it is green again. You may want to play with configuring the cluster differently by modifying the replication factor to understand HDFS behaviour in more detail.

See also

Real clusters use a cluster management system like Yarn to start and stop services and manage map-reduce jobs.

If you are interested to see how that works (voluntarily, not required for the course), you could try the additional steps from the Hadoop documentation to run Yarn on a single node, keeping in mind that we use port 9001 where the documentation uses 9000.

Back to Map-Reduce assignment