Commoncrawl - Part I

The final project is an open assignment, where we will work with data from a large WebCrawl. In part I we start small though!

Preliminaries

Follow this link to the Classroom for Github Commoncrawl assignment, login with your github account, and accept the assignment.

We created a new Docker image that should match software version with the cluster we are creating, aiming to simplify the transition into part II of the project. Create the container like you did in assignment 3, only changing the Docker image’s tag to course:project:

docker pull rubigdata/course:project
docker create --name cc -it -p 8080:8080 -p 9001:9001 -p 4040:4040 rubigdata/course:project
docker start cc

Navigate in your browser to localhost:9001 to import the Zeppelin notebook distributed in the assignment repository (named WARC for Spark (2021).zpln).

Like before, you may also interact with the container itself: docker attach cc to view the logging output of the Zeppelin process and docker exec -it cc /bin/bash to get a bash shell inside the container.

Organization

The assignment consists of three components:

  1. Familiarize yourself with WARC files and their handling in Spark, using the WARC for Spark (2021) notebook.
  2. Learn how to create standalone Spark applications (part II).
  3. Develop your own project (see Assignment below) and deploy the project on a real cluster (more information to follow when cluster is up and running).

The “No Worries” Disclaimer

The main objective is not so much to prepare a winning submission for the next Norvig Award(although you might, should SurfSara get the hang of organizing such a challenge once more - two of the four listed teams were students in a predecessor of this course, long ago when I was affiliated to Delft university).

Merely, we want to get hands-on experience in running and debugging jobs working on large amounts of unstructured data.

The course so far prepared almost all code to work with - for the final project, it is time for you to get creative and modify setup and files provided for your own project’s sake. You will find that the gap between theory & scripted assignments and the reality of coding and real data is rather large; but also that you have acquired the competences to bridge that gap.

Assignment

The assignment is open-ended: “do something with the Commoncrawl data”, and write-up your experience in a final blog post.

Overall objective

Possible project can take different focus:

  • Try to access a large part of the crawl with a simple analysis;
  • Carry out a more complicated multi-step analysis on a smaller sample of crawl data;
  • Study behaviour of the same program (your solution) under different workloads, or using different implementation variants;
  • Understanding exactly how Spark optimizes your workload;

Try to write up something interesting for the blog, whether it be the idea you originally had (even if you had to scale it down), or an in-depth analysis of why things did not work out the way you wanted - if time’s up, time’s up, and I know how much time evaporates (especially when you run into seemingly unsolveable dependency problems or serialization errors).

Step 1

First, check out the Commoncrawl foundation’s excellent get started and other tutorials.

Start your project using a small WARC file that you create yourself, e.g., using webrecorder.io or a recent wget with WARC output:

wget -r -l 3 "http://rubigdata.github.io/course/" --warc-file="course"

This last command is included in an %sh cell in the assignment’s notebook, but you should feel absolutely free to try other things! Copy the WARC file you create to cc:/opt/hadoop/rubigdata.

After a first acquaintance with WARC files and this notebook, look into the resources provided in Addition Background below.

Step 2

As a next step, identify test data of interest using the public CDX Index service (and select a specific crawl, e.g. CC-MAIN-2021-17 for a very recent one, April 2021) to locate a sample of WARC files to download for analysis and code development on your local machine: for example, find BBC captures in the crawl, and check out the corresponding WARC files on Amazon S3. Simply prefix https://commoncrawl.s3.amazonaws.com/ to the crawl-data/CC-MAIN-2021-17/... locations you get from the CDX server.

(you might as well try and use the JSON reader and map the prefixing over the CDX results to solve access to the data using Spark itself!)

If this second step causes trouble on your machine (the WARC files can be large), you can try your code on a smaller WARC file, or simply defer this step until we have moved to working on the cluster.

Part II

When you have a good understanding of working with WARC files in Spark in the Zeppelin notebook, continue to Part II to learn how to create and run standalone Spark applications.

Workflow

The ideal project workflow in my opinion is to

  1. Workout a simple idea on a single WARC file on your own machine in a notebook
  2. Turn the notebook into a standalone application that runs on your own machine
  3. Run the standalone application on the cluster

And only then:

  1. Scale up the job size by working on multiple WARC files, on a segment, or, maybe, if things really work out:
  2. Run your job on a full crawl.

Every time, reconsider if you should not adapt your plan - it could well be that your initial idea was way too ambitious. Not a problem at all! You’ll have only more to write about in your final blog post.

Additional Background

The International Internet Preservation Consortium (IIPC) provides utility code for OpenWayback, an open version of the Internet Archive’s Wayback machine. They also coordinate development of a better Java package for processing WARC files, called jwarc; feel free to explore if that works better than the libraries we applied!

Other related pointers to help you get going include Jimmy Lin’s now outdated Warcbase project and L3S’s (seemingly inactive) ArchiveSpark. These may not be that easy to apply, but contain a lot of example code that might help.

Finally, before developing your own code for specific tasks, check the SparkPackages community index to see if your problem has already been solved (partially) before.

A few minor things:

The course has build up to giving you the tools to tackle this larger task. Don’t hesitate to refer back to the examples from previous assignments; your own blog posts may give you hints how to solve the problems you encounter.

Your project is complete when you submit your blog in Peergrade and push your code to the assignment repository.

Find us in the Matrix room or the Forum when you need additional advice.

Back to Assignments overview / Part II.