Running Self Contained Apps on Redbad

So far, we have only used Spark in an interactive setting, making best use of the Spark Notebook.

You should definitely develop your project like that: locally on your own machine using small test data, taken as samples from the real Commoncrawl.

However, eventually we want to run a standalone program on the cluster, using the actual Web data. This instruction will get you up and running with developing standalone Spark solutions that can be executed on the cluster

Preparation

First install the Simple Build Tool sbt on the rubigdata/project:v0.9.1 Docker image:

echo "deb https://dl.bintray.com/sbt/debian /" | tee -a /etc/apt/sources.list.d/sbt.list
apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
apt-get update
wget -c https://bintray.com/artifact/download/sbt/debian/sbt-0.13.16.deb
dpkg -i sbt-0.13.16.deb

This last command takes a few minutes to execute; go treat yourself on a cup of coffee or tea!

You should use the rest of the waiting time to glance over this blog post on sbt.

RUBigDataApp

Now we can start to actually develop the Standalone Spark App.

Glance over the Spark project’s documentation on creating self-contained Spark apps to find out how to create a standalone Spark application and subsequently run the application using spark-submit for a yarn cluster. Spark documentation on launching jobs from scala and submitting applications are also useful background reads.

The official documentation remains rather verbose, so let us walk through the process with an actual (but overly simple) example of a Spark application written in Scala.

Download a compressed archive rubigdata.tgz with the same code from the course website into your Docker container, and unpack the archive.

wget http://rubigdata.github.io/course/background/rubigdata.tgz
tar xzvfp rubigdata.tgz
cd rubigdata

Copy the included (very small) text file into your home directory on the cluster; feel free to first modify the sample textfile before copying it to HDFS.

export REDBADMASTER=3.218.145.86
export UNAME=<your username>
scp -i id_redbad_$UNAME rubigdata-test.txt $UNAME@$REDBADMASTER:/home/$UNAME/rubigdata-test.txt

Copy file to HDFS (make sure you execute this command on Redbad and not in your local docker container.):

hdfs dfs -put rubigdata-test.txt

Note: the scp command above can also be used to copy the standalone app to Redbad.

The two inputs to building trivial sample app RUBigDataApp are the build.sbt build file and the actual code, provided in rubigdata.scala. The directory structure provided through the compressed archive is important; for the details, look into the sbt documentation.

In order to create the correct hdfs filepath, put your Redbad username in the uname placeholder in the scala file.

Build the stand-alone application:

sbt package

The result is a jar file that we can execute on the cluster using Spark’s spark-submit command:

Update: We modified the submit command to limit the requested resources. Your submission is placed in a queue and will be executed according to the FIFO model.

spark-submit  \
    --master yarn \
    --deploy-mode client \
    --num-executors 1 \
    --driver-memory 512m \
    --executor-memory 512m \
    --executor-cores 1 \
    --queue gold \
    rubigdataapp_2.12-1.0.jar

First inspect the RUBigDataApp.scala code, so you know what to expect as output!

The Spark application’s log can be downloaded from Spark’s History Server listed on localhost:18080/history/<your application id>. (E.g. application_1592881135628_0065)

While the application runs, you will see the already familiar Spark UI; once the application is finished, you get a summary page, from which you can inspect the logs. In our simple example program, these logs contain the output: character counts in the textfile we stored in our home directory on HDFS.

Default ouput should be:

########## OUTPUT ##########
Lines with a: 3, Lines with e: 4
########### END ############

Packaging and Including External Libraries

It makes no sense to write code for subtasks like parsing WARC files from scratch.

To including external libraries, two sbt plugins are very useful to simplify the workflow: sbt-assembly and sbt-pack (follow the links for their github repositories).

With sbt-assembly, you easily create a single large jar (called: fat jar) that includes all dependencies, without having to worry about setting the right CLASSPATH and/or adding all the correct parameters for including libraries when issuing spark-submit. The sbt-pack plugin gathers all dependent libraries in a single location under target/pack, to ease copying these jars into the Spark Notebook docker container, for example.

SBT Assembly is enabled by placing a file plugins.sbt to directory rubigdata/project that contains the line:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.15.0")

Standalone Apps using HadoopConcatGz

The “SurfSara and JWAT” combination that works fine locally, on the WARC for Spark I notebook (project part I), fails on the cluster - a few design decisions do not work out will with distribution through the new Hadoop and Spark.

Follow these instructions to turn code from the new and improved WARC for Spark II notebook (project part II) into a standalone application (should run on the cluster without pain).

Proceed as follows:

  1. Download hadoop-concat-gz-1.2.3-preview.jar and put it in a (new) directory rubigdata/lib/.
  2. Use the following dependencies in your rubigdata/build.sbt file:
      libraryDependencies ++= Seq(
     "org.apache.spark" %% "spark-core" % sparkV % "provided",
     "org.apache.spark" %% "spark-sql"  % sparkV % "provided",
     "org.apache.hadoop" %  "hadoop-client" % hadoopV % "provided",
     "org.jsoup"         % "jsoup"          % "1.11.3",
     "com.github.helgeho" % "hadoop-concat-gz" % "1.2.3-preview" from "file:///rubigdata/lib/hadoop-concat-gz-1.2.3-preview.jar"
      )
    
  3. Create a scala standalone program with snippets reusing code from the notebook
  4. Create the fat jar using sbt assembly (look at the output to find the location of the jar, e.g., my own example creates /rubigdata/target/scala-2.12/RUBigDataTest-assembly-1.0.jar)
  5. Copy the fat jar that you just created to the cluster, e.g. using scp -F redbad-ssh-config RUBigDataTest-assembly-1.0.jar redbad:
  6. Run the fat jar using spark-submit, with submission options like above.

Hint: make sure that the WARC file you analyze in your app exists on the cluster, e.g., in HDFS under your user home-directory.

Next steps

Before trying to analyze Commoncrawl data, it wouldn’t hurt to turn code from your previous assignments into a standalone app.

You may want to include a brief discussion, perhaps using a concrete example, in the introductory part of your blog post, taking your readers along on the road from interactive small scale prepatory experiments to actual use of a large cluster. So keep a log of your progress!