Commoncrawl

Assignment 4 is an open assignment, where we will work with data on the national supercomputer infrastructure managed by SurfSara.

Follow this link to the Classroom for Github Commoncrawl assignment, login with your github account, and accept the assignment.

Disclaimer (a.k.a. “no worries”)

We have yet to see how far we will get - the main objective is not so much to prepare a winning submission for the next Norvig Award, but merely to get hands-on experience in running and debugging jobs on a shared managed cluster, instead of our own laptops or desktops (that at best emulate a real cluster, and in the worse case will mislead you to underestimating the problems of working on actual Big Data).

Preparations

We work on the national Hadoop cluster. Instead of following the Obtaining an account instructions, you receive your account info from me.

After receiving your account info, please modify your password as soon as possible.

Initial setup

Let us work through the basics discussed on the Hadoop cluster usage page.

Start by using the provided docker image.

docker pull surfsara/hathi-client

docker run -it surfsara/hathi-client

Kerberos authentication

Because the national cluster serves many users that not all have access to the same data, authentication is kind-a strict, provided through MIT’s Kerberos.

Inside a Docker container, this works smoothly: use your credentials (obtained from me) to get a Kerberos ticket, by following the SurfSara instructions. The capitals of CUA.SURFSARA.NL are an important detail!

To authenticate through Kerberos on a Linux or Windows machine outside the Docker image, which will be necessary to use the ResourceManager from your web-browser, follow these steps (provided without warranty, by me).

Setting up Spark environment

Spark Notebook is not supported on the national hadoop cluster, but Spark is.

Next, let us go through the basics of running a Spark job on the cluster. Quickly scan the Surfsara specific instructions; to install Spark on the docker image, proceed however with the specific instructions given here:

cd hathi-client
bin/get.sh spark

From now on, initialize the right environment for working with Spark by issuing the following command:

eval $(/hathi-client/bin/env.sh)

(You may want to add this to ${HOME}/.bashrc so it gets executed next time you run bash using docker exec.)

First steps on Hathi

Now try a simple directory listing of the most recent crawl:

hdfs dfs -ls /data/public/common-crawl/crawl-data/CC-MAIN-2016-07
hdfs dfs -ls -h /data/public/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454702039825.90/warc

Then try out the standard example to compute Pi through a random process. (The SurfSara cluster requires jobs to be submitted through yarn; just ignore the warning about yarn-cluster being deprecated.)

cd spark
MASTER=yarn-cluster bin/run-example SparkPi

If you managed to get a Kerberos ticket and configured your Firefox correctly, you can view the application state in the Resource Manager, very similar to this screenshot: ResourceManager

Using Spark on the cluster

If this succeeded, we are ready for using the cluster for real!

Self-contained Spark applications

Before moving on the final project work, carry out the course provided instructions on executing self-contained Spark applications. This helps you run a simple Spark application on the cluster from inside your SurfSara Docker container.

Assignment

Now that you got this far, it is time to be creative and modify the setup and files provided for your own project’s sake.

The assignment is very open-ended: “do something with the Commoncrawl data” and write-up your experience in a final blog post.

Do not forget to check out the Commoncrawl foundation’s excellent get started and other tutorials.

To identify test data of interest, I recommend to use the public CDX Index service to locate a few WARC files to download for analysis and code development on your local machine: for example, find the BBC captures in the crawl, and check out the corresponding WARC files from the crawl on hathi, e.g.,

hdfs dfs -ls /data/public/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701165302.57/warc/CC-MAIN-20160205193925-00246-ip-10-236-182-209.ec2.internal.warc.gz

You can initially work on a small WARC file that you create yourself, e.g., using webrecorder.io or a recent wget with WARC output (included in the SurfSara docker image):

wget -r -l 3 "http://rubigdata.github.io/course/" --warc-file="course"

WARC for Spark

The WARC for Spark notebook will be helpful to play with WARC files in a Spark Notebook, to give a more flexible and efficient approach to develop new code and debug it, independent of the large cluster.

You can view the notebook almost correctly with the IPython notebook viewer

Note: The collection of jar files for the external dependencies.

Other references

SurfSara and the Commoncrawl foundation have provided useful utility code on the Norvig Award github repository. (including classes for a WARCInputFormat). Their code is based on the jwat libraries (inspecting the code may be useful to understand the WarcRecord classes).

The International Internet Preservation Consortium (IIPC) provides utility code for OpenWayback, an open version of the Internet Archive’s Wayback machine. Other related pointers to help you get going include Jimmy Lin’s Warcbase project and L3S’s recent ArchiveSpark. Finally, before developing your own code for specific tasks, check the SparkPackages community index to see if your problem has already been solved (partially) before.

Blog post

Imagine a reader that has followed your previous experiences with Spark. The goal of this post is to share your work on the Commoncrawl data with them, emphasizing the differences between running locally a prepared notebook versus a real project on a more unfamiliar remote environment.

Readers will be curious to learn basic statistics about the crawl, but also how long it takes to make a pass over the data, both on average, and under peak loads.

Ideally, your analysis teaches us something about the Web we did not yet know; but that is not a requirement for completing the assignment.

Wishing you good big data vibes!

Final words

When you completed the assignment:

  1. Push your blog post to the first assignment’s repository (in the gh-pages branch or it will not render).
  2. Include a link to the published blog post in the README of the A4 assignment repository.
  3. Push the updated README as well as your own code or notebook to the A4 assignment repository.

Need help?!

Feel free to ask for help, but please do that by using the github issue tracker on the forum; see the welcome issue as an example of how to proceed. Every student may help out, please contribute and share your knowledge!

By the way, before I forget: students from a previous version of this course did win the Norvig Award in 2014! If you plan on winning the next edition (I am trying to convince the organizers to run it again), I would love to hear from you!

Back to assignments overview