The final project is an open assignment, where we will work with data from a large WebCrawl.
The “No Worries” Disclaimer
The purpose of the assignment is only to gain hands-on experience with biggish data on a smallish cluster.
Self Contained Apps
In Part I you developed an idea on what you might do with those WARC files from the CommonCrawl. Before we can move on and put that idea into working code on a multi-node cluster, we need to learn how to create a standalone application.
So far, we have only used Spark in an interactive setting, making best use of the user-friendly Zeppelin notebooks.
You should definitely develop your project like that: locally on your own machine using small test data, taken as samples from the real Commoncrawl.
However, a cluster is usually shared by many people, and this model does not work that well with the concept of interactive notebooks - they tend to claim too many resources for a long time, and it is unclear when a job will be complete. Also, query optimization may be hindered when the program logic is distributed over multiple cells, that are submitted as isolated operations.
As a result, “real” big data processing rarely happens directly from within your notebook. Instead, we need to create a standalone program that is then deployed on the cluster. Your course project container has been setup to simplify this process, and you can go through the full procedure on your own machine.
Scala Build Tool
We will use the Scala Build Tool to reduce the effort necessary to create standalone applications that are bundled with all their dependencies and can be submitted to multiple worker nodes without problem.
Briefly skim this blog post on
sbt to get an idea what it
is like; it is somewhat similar to using
maven for java development.
We have put a template directory structure ready to use in
/opt/hadoop/rubigdata in your container (I will assume you named it
cc, but you might of course have used a different name when you
Glance over the Spark project’s documentation on creating
and subsequently submitting applications to
we will return to using the Yarn resource manager as in assignment two
spark-submit for a yarn
but for now we keep it simple using only Spark itself without Hadoop and Yarn.
The official documentation remains verbose and incomplete, so let us walk through the process with an actual (but overly simple) example of a Spark application written in Scala.
You find the sample application in a sub-directory of
/opt/hadoop/rubigdata, the working directory when entering the container:
docker exec -it cc /bin/bash
The two inputs to building trivial sample app
RUBigDataApp are the
build.sbt file (the SBT equivalent of maven’s
pom.xml) and the actual code,
The directory structure is important; there are many implicit
assumptions about what is located where - for the details, look into
Look at the provided
RUBigDataApp.scala code so you know what to
expect to see as output!
Build the stand-alone application (the very first time you execute this in your container, it will take longer than subsequent commands):
The result is a jar file generated in the
directory, that we execute using Spark’s
In between all the logging information that scrolls by, you will see
the default output delineated by lines with
##### to make those
########## OUTPUT ########## Lines with a: 3, Lines with e: 4 ########### END ############
Packaging and Including External Libraries
It makes no sense to write code for subtasks like parsing WARC files from scratch. However, it can be tricky to setup a distributed system such that all the dependencies from external libraries are actually found when executing the program.
sbt-assembly, you easily create a single large java archive (or jar,
(referred to as fat jar) that includes all dependencies, without
having to worry about setting the right
CLASSPATH and/or adding all
the correct parameters for including libraries when issuing
Once the plugin is enabled, you create a fat jar using the command
sbt assembly instead of
sbt package. This assembly plugin is enabled
through the existence of file
plugins.sbt in directory
rubigdata/project that contains the line:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.15.0")
The course container takes care of more dependencies on your behalf, and unless you want to use other libraries, you may not need to know more. However, let’s describe this setup in more detail, should you come across a need for doing this yourself with an additional external library.
First, we have the dependencies listed in the
rubigdata/build.sbt file, where the fourth and fifth lines specify
that we use the default JSoup library distributed through a standard
method of referencing libraries, and the
provided as a java archive:
libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkV % "provided", "org.apache.spark" %% "spark-sql" % sparkV % "provided", "org.apache.hadoop" % "hadoop-client" % hadoopV % "provided", "org.jsoup" % "jsoup" % "1.11.3", "com.github.helgeho" % "hadoop-concat-gz" % "1.2.3-preview" from "file:///opt/hadoop/rubigdata/lib/hadoop-concat-gz-1.2.3-preview.jar" )
The second place where dependencies are specified is within the interpreter settings to Zeppelin. Of course, it is mandatory to keep these two places consistent, or your standalone program would use different libraries from the code running in the notebook.
sbt-pack plugin can be useful by gathering all
dependent libraries in a single location under
target/pack, to ease
copying these jars into the Zeppelin docker container, for example.
Before trying to analyze Commoncrawl data, it wouldn’t hurt to turn
code from your previous assignments into a standalone app. Assignment
5 is especially suited for deployment this way, as the output to the
console may actually be easier to interpret from the commandline than
in the Zeppelin notebook.
You may want to include a brief discussion, perhaps using a concrete example, in the introductory part of your blog post; taking your readers along on the road from interactive small scale prepatory experiments to actual use of a large cluster. You don’t have to do this, but just in case you would want to do that later on: with a larger project like this, always keep a log of your progress!
With the knowledge from part I to manipulate WARC files in Spark and the new skills from using self contained apps you are ready for Part III where we run your final programs on a real cluster.
More information to follow.