Commoncrawl
Commoncrawl
The final project is an open assignment, where we will work with data from a large WebCrawl.
The “No Worries” Disclaimer
The purpose of the assignment is only to gain hands-on experience with biggish data on a smallish cluster.
Self Contained Apps
In Part I you developed an idea on what you might do with those WARC files from the CommonCrawl. Before we can move on and put that idea into working code on a multi-node cluster, we need to learn how to create a standalone application.
So far, we have only used Spark in an interactive setting, making best use of the user-friendly Zeppelin notebooks.
You should definitely develop your project like that: locally on your own machine using small test data, taken as samples from the real Commoncrawl.
However, a cluster is usually shared by many people, and this model does not work that well with the concept of interactive notebooks - they tend to claim too many resources for a long time, and it is unclear when a job will be complete. Also, query optimization may be hindered when the program logic is distributed over multiple cells, that are submitted as isolated operations.
As a result, “real” big data processing rarely happens directly from within your notebook. Instead, we need to create a standalone program that is then deployed on the cluster. Your course project container has been setup to simplify this process, and you can go through the full procedure on your own machine.
Scala Build Tool
We will use the Scala Build Tool to reduce the effort necessary to create standalone applications that are bundled with all their dependencies and can be submitted to multiple worker nodes without problem.
Briefly skim this blog post on
sbt
to get an idea what it
is like; it is somewhat similar to using maven
for java development.
RUBigDataApp
We have put a template directory structure ready to use in
/opt/hadoop/rubigdata
in your container (I will assume you named it
cc
, but you might of course have used a different name when you
issued docker create
).
Glance over the Spark project’s documentation on creating
self-contained Spark
apps
and subsequently submitting applications to
Spark. Eventually
we will return to using the Yarn resource manager as in assignment two
(see spark-submit
for a yarn
cluster),
but for now we keep it simple using only Spark itself without Hadoop and Yarn.
The official documentation remains verbose and incomplete, so let us walk through the process with an actual (but overly simple) example of a Spark application written in Scala.
You find the sample application in a sub-directory of
/opt/hadoop/rubigdata
, the working directory when entering the container:
docker exec -it cc /bin/bash
The two inputs to building trivial sample app RUBigDataApp
are the
build.sbt
file (the SBT equivalent of maven’s pom.xml
) and the actual code,
provided in src/main/scala/org/rubigdata/RUBigDataApp.scala
.
The directory structure is important; there are many implicit
assumptions about what is located where - for the details, look into
the sbt
documentation.
Look at the provided RUBigDataApp.scala
code so you know what to
expect to see as output!
Build the stand-alone application (the very first time you execute this in your container, it will take longer than subsequent commands):
sbt package
The result is a jar file generated in the target/scala-2.12
directory, that we execute using Spark’s spark-submit
command:
spark-submit target/scala-2.12/rubigdataapp_2.12-1.0.jar
In between all the logging information that scrolls by, you will see
the default output delineated by lines with #####
to make those
stand out:
########## OUTPUT ##########
Lines with a: 3, Lines with e: 4
########### END ############
Packaging and Including External Libraries
It makes no sense to write code for subtasks like parsing WARC files from scratch. However, it can be tricky to setup a distributed system such that all the dependencies from external libraries are actually found when executing the program.
To including external libraries, two sbt
plugins
are help simplify this workflow:
sbt-assembly
and
sbt-pack
(follow the links for their github repositories).
With sbt-assembly
, you easily create a single large java archive (or jar,
(referred to as fat jar) that includes all dependencies, without
having to worry about setting the right CLASSPATH
and/or adding all
the correct parameters for including libraries when issuing
spark-submit
commands.
Once the plugin is enabled, you create a fat jar using the command
sbt assembly
instead of sbt package
. This assembly plugin is enabled
through the existence of file plugins.sbt
in directory
rubigdata/project
that contains the line:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.15.0")
The course container takes care of more dependencies on your behalf, and unless you want to use other libraries, you may not need to know more. However, let’s describe this setup in more detail, should you come across a need for doing this yourself with an additional external library.
First, we have the dependencies listed in the
rubigdata/build.sbt
file, where the fourth and fifth lines specify
that we use the default JSoup library distributed through a standard
method of referencing libraries, and the HadoopConcatGz
library
provided as a java archive:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkV % "provided",
"org.apache.spark" %% "spark-sql" % sparkV % "provided",
"org.apache.hadoop" % "hadoop-client" % hadoopV % "provided",
"org.jsoup" % "jsoup" % "1.11.3",
"com.github.helgeho" % "hadoop-concat-gz" % "1.2.3-preview" from "file:///opt/hadoop/rubigdata/lib/hadoop-concat-gz-1.2.3-preview.jar"
)
The second place where dependencies are specified is within the interpreter settings to Zeppelin. Of course, it is mandatory to keep these two places consistent, or your standalone program would use different libraries from the code running in the notebook.
Finally, the sbt-pack
plugin can be useful by gathering all
dependent libraries in a single location under target/pack
, to ease
copying these jars into the Zeppelin docker container, for example.
Next steps
Before trying to analyze Commoncrawl data, it wouldn’t hurt to turn
code from your previous assignments into a standalone app. Assignment
5 is especially suited for deployment this way, as the output to the
console
may actually be easier to interpret from the commandline than
in the Zeppelin notebook.
You may want to include a brief discussion, perhaps using a concrete example, in the introductory part of your blog post; taking your readers along on the road from interactive small scale prepatory experiments to actual use of a large cluster. You don’t have to do this, but just in case you would want to do that later on: with a larger project like this, always keep a log of your progress!
Part III
With the knowledge from part I to manipulate WARC files in Spark and the new skills from using self contained apps you are ready for Part III where we run your final programs on a real cluster.
More information to follow.