Course Information RU Big Data 2020 (NWI-IBC036)
Related Course Information
The practical work for the course serves two main objectives:
- Get to know the emerging Big Data platforms in more detail by using them, with a focus on data analysis using Spark.
- Gain hands-on experience using modern development tools and services, including
Jupyter notebooks, etc..
Completing and Evaluating Assignments
In this course, we use Github for education to hand in brief reports (like a blog post) about the results of the practical work. After completing your blog post, you submit your blog post URL to PeerGrade (through Brightspace) and take part in a peer review phase.
Refer to assignment 1a for more details about the blogging framework.
Check out the forum to get help, or even help out your peers.
If you feel the forum is not the right place for your question you can contact one of the student assistents at
L.Kuiper AT student.ru.nl (Laurens),
D.Anthes AT student.ru.nl (Daniel) or
P.Boers AT student.ru.nl (Pepijn).
Assignments have started, February 20th, 2020.
Blogging and Docker
The first step is to get familiar with blogging through Github pages using
Accept the assignment (link in the Blogging assignment page) and complete it before
the March 5th lab session; your latest commit will automatically become the submission
of the result at half past eight in the morning of March 5.
- Blogging assignment (1A)
If you are done with the blogging assignment, continue to get familiar with Docker and Scala:
Hadoop: HDFS and Map Reduce
The second assignment gives you some hands-on experience with the de facto big data standard tooling, Hadoop. Learn to use HDFS to store your data, and process it with Map Reduce.
Deadline March 26th, noon.
Spark: RDDs and Dataframes
The third assignment provides exercises to help you understand Spark’s basic APIs using RDDs and Dataframes/SQL.
Deadline May 3rd, midnight.
Spark ML: Recsys using Matrix Factorization with ALS
The guest lecture given by two Data Science alumni who joined Big Data Repulbic, May 4th. The lecture is followed by a (brief) hackaton, results due in the lab session on May 14th, 9.55:
Deadlines: May 14th 9.55 for the leaderboard; May 15th noon (12.00) for the finalised blogpost.
Deadline: June 8th, Midnight.
The final project brings together everything we learned in the previous exercises: a big data analysis on a (sample of a) large Web crawl distributed by the CommonCrawl.
A successful project faces a few differences from the way of work so far:
- Run Spark code standalone, outside the notebook interface;
- Scaling up your workload in multiple steps, to tackle the issues that you encounter one by one.
The deadline for the final project has been set to Monday June 30th (round 1) or July 9th (round 2).