Tutorial
Tutorial
Background
We have adapted the Pseudo-Distributed Cluster example from the Hadoop documentation and made it easier to execute. Glance over these Hadoop documentation, with the lectures in the back of your mind, but then follow the instructions below.
Create, start and attach to a container:
docker create --name hello-hadoop -it -p 8088:8088 -p 9870:9870 rubigdata/course:a2
docker start hello-hadoop
docker attach hello-hadoop
Where am I, what is going on?
whoami
pwd
ls
ps -ef
printenv PATH
In other words, the container starts as user hadoop
in the directory
where Hadoop has been installed, and a sshd
(ssh server) has been
started. The Hadoop directories are included on the path, so unlike in
the online tutorial text, you do not have to prefix commands with
bin/
and sbin/
, the shell will find them.
Tutorial
HDFS
Create a distributed file system and start a cluster:
hdfs namenode -format
start-dfs.sh
If you later reuse this container (the container named hello-hadoop, not the image!) you do not need to create the file system again, and you’d startup the cluster right away.
Create your (we are logged in as user hadoop
) home directory, and
upload some files from the local filesystem (in the container) to the
distributed filesystem (in this case, also in the container - but keep
in mind that they are very different filesystems, one using a default
Linux or Windows filesystem like ext4
or NTFS
locally, and the
other one running HDFS
on multiple machines; check again the
processes that run, ps -ef
, to see how many datanodes we have).
hdfs dfs -mkdir -p /user/hadoop
hdfs dfs -mkdir input
hdfs dfs -put etc/hadoop/*.xml input
Provided that you used the container creation command above that
exposes port 9870 using -p 9870:9870
, you can access the namenode
Web UI in your browser.
Yarn
On production clusters, you never run jobs by hand, but use a special service that manages the cluster resources. The main motivation here is to enable resource sharing among a team of data scientists who all want to claim disk and memory for their own analyses.
The default choice since Hadoop 2 is yarn
, yet another
resource negotiator. You start and stop yarn using start-yarn.sh
and stop-yarn.sh
, and issue Map Reduce jobs with yarn jar
.
So, start the resource manager:
start-yarn.sh
Map Reduce
You submit a Map Reduce job to yarn
as follows:
yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar grep input output 'dfs[a-z.]+'
To really understand what is happening, you will have to look into
the Map Reduce program that is executed, but let us first complete this
walk-through before we look into the details. For now, it is
sufficient to realize that grep
applies a regular expression over
the input data, and returns the entries that match the regular
expression (dfs[a-z.]+
) (a task that you can also achieve with the
UNIX command-line utility grep
, probably familiar).
When the program has completed, you have to move the results from the distributed filesystem onto your local filesystem before you can access them for further analysis:
hdfs dfs -get output output
cat output/*
Always be careful when copying results from HDFS
as they can be
much larger than your local filesystem! In this specific case, where
you know you have small input and therefore also small output sizes
(in my case, the log states Bytes Written=29
), it is safe to do so;
you could even usefully apply hdfs dfs -cat output/*
instead.
You may want to run more Hadoop examples, I like the one that estimates the value of 𝛑 with a (quasi-) Monte Carlo simulation:
yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar pi 16 10000
If you increase the number of map tasks (from 16) and/or their sample
size (from 10000), you will improve the estimation but the code runs
longer. This does give a good opportunity to inspect the job running
with the resource manager Web UI, exposed at port 8088. If you
created the container with the -p 8088:8088
parameter, you can
access the Web UI in your browser. The
resource manager tells you the cluster state, the applications that
run and have run, etc.
Note: because we do not have a real cluster (only a pseudo-distributed one), links to the application logs will not work correctly.
Wrap up
Now read up on the other command options to hdfs dfs
, online at the
Hadoop documentation site, or in your container (press ‘q’ to exit
less
):
hdfs dfs -help | less
When you have seen enough, stop yarn
and the HDFS cluster:
stop-yarn.sh
stop-dfs.sh
You may then deattach the container by pressing ^D
.
When you restart the container later, you may have to restart the SSH
daemon using sudo /opt/ssh/sbin/sshd
, check ps -ef
to find out if
you have to. Next, you restart the cluster using start-dfs.sh
and
start-yarn.sh
.
If you want to close the terminal but keep the container running in
the background until you re-attach, you can also use the special key
sequence ^p^q
(but do re-attach and stop yarn and HDFS gracefully
before you shutdown the computer, or the container may end up in an
undefined state).
Should you really want to install additional software on the
container, the basis is a ubi8/ubi-minimal
container where you
install software with microdnf
. Because you run as user hadoop
,
you need to sudo
and enter rubigdata2021
as a password.
Never hesitate to ask for help in the Matrix room if you get stuck!
Back to Assignment / Two.