Course Information RU Big Data 2018 (NWI-IBC036)
Related Course Information
The practical work for the course serves two main objectives:
- Get to know the emerging Big Data platforms in more detail by using them, with a focus on data analysis using Spark.
- Gain hands-on experience using modern development tools and services, including
Jupyter notebooks, etc..
Completing and Evaluating Assignments
In this course, we use Github for education to hand in brief reports (like a blog post) about the results of the practical work. Standard git functionality serves perfectly for carrying out peer review, and keeping track of modifications in response to feedback.
Refer to assignment 1a for more details about the blogging framework.
Doublecheck with Blackboard announcement to avoid common mistakes.
Check out the forum to get help, or even help out your peers.
If you feel the forum is not the right place for your question you can contact one of the student assistents at
wietse.kuipers AT student.ru.nl or
J.martens AT student.ru.nl
February 21st: Setup and first test blog
Completing the first assignment is not evaluated. However, I expect that every student will be able to create their assignment reports in Markdown and have these converted to HTML at their github page, and know how to use Docker on their hardware of choice (be it a laptop, their home computer or a PC in HG 00.075).
Usually, working with big data requires accessing a large cluster through a remote shell (
If you are not that familiar with using a computer from the commandline in a terminal, or struggle
with UNIX commands, I recommend this UNIX Tutorial for Beginners;
nicely organised by topic.
Once you have your template blogpost published and Docker installed, explore Scala using the links in assignment 1B. Write up your experience in markdown, following 1A to publish those as a first blog post.
Instructions (assignments A1a and A1b):
Final words: Practice makes perfect.
Take your time to improve your CLI (CommandLine Interface) skills! Mastering UNIX is a steep learning curve, but the time invested pays back bigtime in the long run.
March 14th: Hadoop and Map-Reduce
Extended deadline, in case you missed this one: March 21st. Only and final extension for this assignment.
The objective of assignment 2 is to gain some hands-on experience with HDFS.
The first week, we install HDFS in pseudo-distributed mode, and explore the default examples. The subsequent weeks, you write your blog-post discussing your own Map Reduce jobs on your own cluster.
Enjoy the Elephant in the Room!
April 25th: Spark
I recommend to complete assignment 3A before the Mid-Term test.
The objective of assignment 3 is to gain hands-on experience with Spark.
Assignment 3A is designed to enhance your understanding of RDDs and how they are executed. Assignment 3B helps you carry out a basic data analysis task using Spark Dataframes and/or Spark SQL.