The course on Brightspace and in the course prospectus.

Practical Assignments

Objectives

The practical work for the course serves two main objectives:

  1. Get to know the emerging Big Data platforms in more detail by using them, with a focus on data analysis using Spark.
  2. Gain hands-on experience using modern development tools and services, including git, github, docker, Jupyter notebooks, etc..

Completing and Evaluating Assignments

In this course, we use Github for education to hand in brief reports (like a blog post) about the results of the practical work. After completing your blog post, you submit your blog post URL to PeerGrade (through Brightspace) and take part in a peer review phase.

Refer to assignment 1a for more details about the blogging framework.

Getting help

Check out the forum to get help, or even help out your peers. If you feel the forum is not the right place for your question you can contact one of the student assistents at L.Kuiper AT student.ru.nl (Laurens), D.Anthes AT student.ru.nl (Daniel) or P.Boers AT student.ru.nl (Pepijn).

Assignments Schedule

Assignments have started, February 20th, 2020.

Blogging and Docker

The first step is to get familiar with blogging through Github pages using git. Accept the assignment (link in the Blogging assignment page) and complete it before the March 5th lab session; your latest commit will automatically become the submission of the result at half past eight in the morning of March 5.

Instructions:

If you are done with the blogging assignment, continue to get familiar with Docker and Scala:

Hadoop: HDFS and Map Reduce

The second assignment gives you some hands-on experience with the de facto big data standard tooling, Hadoop. Learn to use HDFS to store your data, and process it with Map Reduce.

Deadline March 26th, noon.

Spark: RDDs and Dataframes

The third assignment provides exercises to help you understand Spark’s basic APIs using RDDs and Dataframes/SQL.

Deadline May 3rd, midnight.

Spark ML: Recsys using Matrix Factorization with ALS

The guest lecture given by two Data Science alumni who joined Big Data Repulbic, May 4th. The lecture is followed by a (brief) hackaton, results due in the lab session on May 14th, 9.55:

Deadlines: May 14th 9.55 for the leaderboard; May 15th noon (12.00) for the finalised blogpost.

Streaming

Deadline: June 8th, Midnight.

Final project

The final project brings together everything we learned in the previous exercises: a big data analysis on a (sample of a) large Web crawl distributed by the CommonCrawl.

A successful project faces a few differences from the way of work so far:

  • Run Spark code standalone, outside the notebook interface;
  • Scaling up your workload in multiple steps, to tackle the issues that you encounter one by one.

Instructions:

The deadline for the final project has been set to Monday June 30th (round 1) or July 9th (round 2).