Hadoop is an open source project that implements the Big Data frameworks discussed in lectures 2 - 4 (distributed filesystems and Map-Reduce).

In this assignment, we install Hadoop on our own “pseudo-cluster”, and use Map-Reduce to do some basic count operations over texts from Shakespeare.

You accept the assignment via Github for Education, using this invitation link (do this only once per assignment, and select your student number from the list).

Objectives:

  1. Set up a pseudo-distributed cluster running HDFS
  2. Learn to use the commands to work with HDFS
  3. Implement and run your own map-reduce jobs

A suggested timeline would be to tackle the first two goals in week one, and the final goal in week two; if you can go at a higher pace, that is fine.

Make sure you understand how the concepts in the main lectures relate to the steps in the assignment, and do not just rush through without thinking.

Setup

First, setup a distributed filesystem HDFS and the Map-Reduce tools using our Hadoop tutorial. Work through the tutorial until the end, and you can start and stop an HDFS cluster on your own machine (inside the course docker).

HDFS background information that is useful to further your understanding of what you have been doing in the tutorial:

Make sure that you understand conceptually what is going on (refer back to the lecture material where needed) and that you know how to use the filesystem: create files, read them, delete files and directories, etc.

Run the grep examples with a different pattern on a different set of files to validate that you know what you are doing. You have then achieved the first two objectives of assignment two, and are ready for the real work!

Your own Map Reduce jobs

These instructions assume you completed the previous steps: you have created a pseudocluster that runs HDFS.

Download the Complete Shakespeare from the github website and save it to the HDFS:

cd /opt/hadoop
wget https://raw.githubusercontent.com/rubigdata-dockerhub/hadoop-dockerfile/master/100.txt
hdfs dfs -put 100.txt input

When you clone your assignment repository (from github classroom), you find an example WordCount.java. You can also extract it from the examples archive (use jar tf and jar xf), or simply copy-paste the sourcecode from the relevant Hadoop documentation.

Whatever you choose to get the template code for an example word counting job, ensure that your program WordCount.java exists in the /opt/hadoop directory.

Set up the environment to compile and run the WordCount code:

hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class

You can ignore the twobad substitution errors.1/opt/hadoop/libexec/hadoop-functions.sh: line 2461: HADOOP_COM.SUN.TOOLS.JAVAC.MAIN_OPTS: bad substitution

Now, you are ready to run the WordCount code:

yarn jar wc.jar WordCount input output

The output of your code is now located in the file output in the HDFS and can be inspected like you did in the Hadoop tutorial.

If you previously ran the tutorial in the same container, you created a directory output, that may now cause an error (mentioning that output exists 2). Remove the old output directory (from HDFS) and try again! If you see “weird” output, do check out what input you actually ran your program on…

Now, adapt the code for counting something else; number of lines, words or characters, or something more interesting; it is really up to you to decide what to count!

Refer to the Map-Reduce documentation for detailed information: tutorial WordCount v1.0

PS: given that we have 150+ students; I would be really happy to read one or two blogs on counting co-occurrences!

Blog post

The assignment is to write a blog post about your experience with HDFS and Map-Reduce. Assume the reader knows what a distributed filesystem is, and why you would use it. You should walk your readers through a simple counting example using Map-Reduce, illustrated by your own java code (no matter how small the change to the provided example).

Address at least the following questions in your post:

  • What happens when you run the Hadoop commands (hdfs dfs etc.) in the first part of the tutorial?
  • How do you use mapreduce to count the number of lines/words/characters/… in the Complete Shakespeare?
  • Does Romeo or Juliet appear more often in the plays? ++ Can you answer this question making only one pass over the corpus?

You might want to make use of screenshots taken from the namenode’s Web UI and/or Yarn’s resourcemanager UI to make clear what is going on “inside”.

If things go smooth, try to compute the average number of words or characters per line using the patterns we discussed in the second map reduce lecture. If things go really smooth, you should try to use a combiner and discuss the improvement achieved and/or problems encountered.

Done

When you completed the assignment, push your blog post to the first assignment’s repository and include a link to the published blog post in the README of the assignment repository. Commit and push the updated README and your code to the assignment repository.

In other words:

Instructions to submit your completed work (replace USERNAME by your github account):

  • Write your blog in the blogpost repository you made for assignment 1. This repostory is located at https://github.com/rubigdata/big-data-blog-2021-USERNAME
  • Make sure your blog is published and the post is accessible from https://rubigdata.github.io/bigdata-blog-2021-USERNAME
  • Place a link to the published blogpost (for example https://rubigdata.github.io/bigdata-blog-2021-USERNAME/assignment2) in the README.md of your assignment 2 repository, which is located at https://github.com/rubigdata/hello-hadoop-2021-USERNAME
  • Add your code and commit your modifications, and push your repositories (both the blog and the assignment repos)

When you’re done, you submit the URL of your blog to PeerGrade (through Brightspace).

Help?!

Feel free to ask for help, but please also use the github issue tracker on the forum; see the first issue as an example of how to proceed. Every student may help out, please contribute and share your knowledge!

Back to assignment overview.

/opt/hadoop/libexec/hadoop-functions.sh: line 2366: HADOOP_COM.SUN.TOOLS.JAVAC.MAIN_USER: bad substitution
/opt/hadoop/libexec/hadoop-functions.sh: line 2461: HADOOP_COM.SUN.TOOLS.JAVAC.MAIN_OPTS: bad substitution
  1. org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/hadoop/output already exists