CommonCrawl

We are going to use an AWS EMR cluster, in the cloud, for the final project using the CommonCrawl.

CommonCrawl is distributed on AWS cloud storage through an S3 bucket in the us-east-1 region. By running the cluster in the same region, we can work directly on S3 instead of copying the complete crawl to HDFS. Should your project analyze a subset in greater detail, requiring multiple passes over the data, do not hesitate to request a copy of the subset in HDFS.

Using the Crawl Index

The CommonCrawl foundation supports a project to simplify the selection of subsets of the crawl using SQL. An introductory blog post introduces the general setup. The cc-index repository includes a wide variety of example scripts. Especially useful for us are the example Spark java programs to copy selections from the index and/or WARCs to another S3 bucket for further processing.

Using the Crawl Data

Using with the CommonCrawl datasets on EMR is discussed in a pretty good tutorial post (disclaimer: from 2017). Another useful example uses s3distcp to copy data from S3 CC onto a Hadoop filesystem in the EMR cluster (disclaimer: from 2015).

Unless you have special requirements on the data to use, I suggest to take the most recent crawl ([September 2020][cc-2020-40-blog]), available on S3 here: over 80 TB of WARC files in addition to extra versions (WAT and WET).

The emphasis in our assignment is to learn to process the WARC files; where the scale of the job is of secondary importance.

Getting started

Because Zeppelin did not want to behave well enough with our multi-tenancy setup, this section has been modified to let you do more of the exploration using notebooks in your own docker container.

On the cluster, you will have to use spark-submit with standalone applications, at least for now.

The following two notebooks can help you develop an application for the cluster. We start with Warc for Spark II:

Download the notebook and import it into the Zeppelin running in your container.

This first notebook requires one additional step of setup; it uses different dependencies (the WarcReader we used in the course before does not work well together with Hadoop v3). For final preparation of your container before running the notebook, do docker exec -it cc /bin/bash and issue:

wget http://rubigdata.github.io/course/background/hadoop-concat-gz-1.2.3-preview.jar
mv /hadoop-concat-gz-1.2.3-preview.jar /zeppelin/lib

The imported notebook file should work correctly on your local machine; its dependencies assume you put the jar file at /zeppelin/lib. The notebook instructions how to create an HDFS filesystem are of course to be carried out in your docker container, it already exists on the real cluster.

About a week ago, imagining that Zeppelin would work fine on the cluster, I prepared a second notebook that would have been very handy… but even now, it maybe still useful for inspiration. The notebook demonstrates how to use CommonCrawls CC-INDEX to identify interesting subsets of their data:

This notebook does not run well from home, unfortunately, because you’d have to read a lot of data from S3 which would not be free from here (note: this is free on the cluster, because data and cluster reside in the same AWS region). The notebook can still be of help to kick-start code to select a subset of WARCs to work with in your subsequent analysis; it will not be that difficult to include parts of it in your standalone application, for those whose projects get ready to scale to larger sets of WARCs.

A Zeppelin notebook is not really suited for running jobs that take a long time, e.g., on larger amounts of data. The other component that you need is the updated assignment instructions for the final project, where you learn how to turn a notebook into a so-called fat jar package that you submit to the cluster from the commandline. This is what you should do when you are going to process larger numbers of WARC files than would fit comfortably on a single node.

Links to this lecture material:

In addition to the instructions released for Part I.

See also

Related sources:

Back to the EMR student access page or on to the cluster page.