Commoncrawl

The final project is an open assignment, where we will work with data from a large WebCrawl.

The “No Worries” Disclaimer

The purpose of the assignment is only to gain hands-on experience with biggish data on a smallish cluster.

Self Contained Apps

In Part I you developed an idea on what you might do with those WARC files from the CommonCrawl. Before we can move on and put that idea into working code on a multi-node cluster, we need to learn how to create a standalone application.

So far, we have only used Spark in an interactive setting, making best use of the user-friendly Zeppelin notebooks.

You should definitely develop your project like that: locally on your own machine using small test data, taken as samples from the real Commoncrawl.

However, a cluster is usually shared by many people, and this model does not work that well with the concept of interactive notebooks - they tend to claim too many resources for a long time, and it is unclear when a job will be complete. Also, query optimization may be hindered when the program logic is distributed over multiple cells, that are submitted as isolated operations.

As a result, “real” big data processing rarely happens directly from within your notebook. Instead, we need to create a standalone program that is then deployed on the cluster. Your course project container has been setup to simplify this process, and you can go through the full procedure on your own machine.

Scala Build Tool

We will use the Scala Build Tool to reduce the effort necessary to create standalone applications that are bundled with all their dependencies and can be submitted to multiple worker nodes without problem.

Briefly skim this blog post on sbt to get an idea what it is like; it is somewhat similar to using maven for java development.

`RUBigDataApp`

We have put a template directory structure ready to use in /opt/hadoop/rubigdata in your container (I will assume you named it cc, but you might of course have used a different name when you issued docker create).

Glance over the Spark project’s documentation on creating self-contained Spark apps and subsequently submitting applications to Spark. Eventually we will return to using the Yarn resource manager as in assignment two (see spark-submit for a yarn cluster), but for now we keep it simple using only Spark itself without Hadoop and Yarn.

The official documentation remains verbose and incomplete, so let us walk through the process with an actual (but overly simple) example of a Spark application written in Scala.

You find the sample application in a sub-directory of /opt/hadoop/rubigdata, the working directory when entering the container:

docker exec -it cc /bin/bash

The two inputs to building trivial sample app RUBigDataApp are the build.sbt file (the SBT equivalent of maven’s pom.xml) and the actual code, provided in src/main/scala/org/rubigdata/RUBigDataApp.scala. The directory structure is important; there are many implicit assumptions about what is located where - for the details, look into the sbt documentation.

Look at the provided RUBigDataApp.scala code so you know what to expect to see as output!

Build the stand-alone application (the very first time you execute this in your container, it will take longer than subsequent commands):

sbt package

The result is a jar file generated in the target/scala-2.12 directory, that we execute using Spark’s spark-submit command:

spark-submit target/scala-2.12/rubigdataapp_2.12-1.0.jar

In between all the logging information that scrolls by, you will see the default output delineated by lines with ##### to make those stand out:

########## OUTPUT ##########
Lines with a: 3, Lines with e: 4
########### END ############

Packaging and Including External Libraries

It makes no sense to write code for subtasks like parsing WARC files from scratch. However, it can be tricky to setup a distributed system such that all the dependencies from external libraries are actually found when executing the program.

To including external libraries, two sbt plugins are help simplify this workflow: sbt-assembly and sbt-pack (follow the links for their github repositories).

With sbt-assembly, you easily create a single large java archive (or jar, (referred to as fat jar) that includes all dependencies, without having to worry about setting the right CLASSPATH and/or adding all the correct parameters for including libraries when issuing spark-submit commands.

Once the plugin is enabled, you create a fat jar using the command sbt assembly instead of sbt package. This assembly plugin is enabled through the existence of file plugins.sbt in directory rubigdata/project that contains the line:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.15.0")

The course container takes care of more dependencies on your behalf, and unless you want to use other libraries, you may not need to know more. However, let’s describe this setup in more detail, should you come across a need for doing this yourself with an additional external library.

First, we have the dependencies listed in the rubigdata/build.sbt file, where the fourth and fifth lines specify that we use the default JSoup library distributed through a standard method of referencing libraries, and the HadoopConcatGz library provided as a java archive:

  libraryDependencies ++= Seq(
	"org.apache.spark" %% "spark-core" % sparkV % "provided",
	"org.apache.spark" %% "spark-sql"  % sparkV % "provided",
	"org.apache.hadoop" %  "hadoop-client" % hadoopV % "provided",
	"org.jsoup"         % "jsoup"          % "1.11.3",
	"com.github.helgeho" % "hadoop-concat-gz" % "1.2.3-preview" from "file:///opt/hadoop/rubigdata/lib/hadoop-concat-gz-1.2.3-preview.jar"
  )

The second place where dependencies are specified is within the interpreter settings to Zeppelin. Of course, it is mandatory to keep these two places consistent, or your standalone program would use different libraries from the code running in the notebook.

Finally, the sbt-pack plugin can be useful by gathering all dependent libraries in a single location under target/pack, to ease copying these jars into the Zeppelin docker container, for example.

Next steps

Before trying to analyze Commoncrawl data, it wouldn’t hurt to turn code from your previous assignments into a standalone app. Assignment 5 is especially suited for deployment this way, as the output to the console may actually be easier to interpret from the commandline than in the Zeppelin notebook.

You may want to include a brief discussion, perhaps using a concrete example, in the introductory part of your blog post; taking your readers along on the road from interactive small scale prepatory experiments to actual use of a large cluster. You don’t have to do this, but just in case you would want to do that later on: with a larger project like this, always keep a log of your progress!

Part III

With the knowledge from part I to manipulate WARC files in Spark and the new skills from using self contained apps you are ready for Part III where we run your final programs on a real cluster.

More information to follow.

Back to assignments overview / part I