Course Information RU Big Data 2021 (NWI-IBC036)
Related Course Information
The practical work for the course serves two main objectives:
- Get to know the emerging Big Data platforms in practice (especially Spark)
- Gain hands-on experience with modern development tools and
Jupyter notebooks, etc..
|A1||Big Data Blog||10/2 & 11/2||21/2||28/2||⇓|
|A2||Hello Hadoop||3/3 & 4/3||22/3 *||29/3||⇓|
|A3||Sparkling Spark||24/3 & 25/3||12/4 †||19/4||⇓|
|A4||Open Data Frame||7/4 & 8/4||26/4||3/5||⇓|
|A5||Streaming Data||21/4 & 22/4||10/5||17/5||⇓|
|P||Project||12/5 & 13/5||14/6 (option 1) / 5/7 (option 2)||21/6 (option 1) / 12/7 (option 2)||⇓|
*,†: Recommended date - one week grace time for late submission (with Peer review starting upon submission).
Proceed as follows:
- Clone the assignment repository from the Github Classroom
- Follow the assignment instructions, and keep a log of your progress
- Push the assignment repository to github (A2-5)
- Write up your results in a blog post in the A1 repository
- Submit the blog post in PeerGrade (through Brightspace)
- Review up to 3 blogs from your peers (number varies per assignment)
Deadlines are midnight, CET. Peer reviews are assigned upon submission.
Refer to A1 for detailed information on the blogging framework.
Official hours for the Lab sessions are Thursdays 8.30 - 10.15.
During those hours, you can count on finding TAs Lisa Hoek and Niels Cornelissen available in the Matrix Room for the Lab Sessions for Q&A and helping out with issues you may encounter. I try to attend myself every lab session, but cannot always make it.
Wednesdays 13.30 - 15.15 is the backup hour for Lab sessions; PhD candidate Chris Kamphuis and myself will keep an eye on the room for those students who cannot make it on the Thursdays. Feel free to work on the assignments throughout the week, TAs and teachers will occasionally find the time to answer questions at other moments too - just put them in the Matrix room and we will take it from there.
Detailed Assignment Instructions
Links to the assignments are added after release.
A1: Blogging and Docker
The first step is to get familiar with blogging through Github pages
git. Accept the assignment (link in the Blogging assignment
page) and complete it before the March 5th lab session; your latest
commit will automatically become the submission of the result at half
past eight in the morning of March 5.
Setting up Docker and using it to meet Scala as a new language (to most) is covered in separate pages:
Hand in your assignment by posting the blog to Peergrade, one of the Brightspace learning modules. Details follow on Brightspace!
Deadline Feb 21st, 2021, 23:59 CEST.
A2: Hello Hadoop
The second assignment gives you some hands-on experience with the de facto big data standard tooling, Hadoop. Learn to use HDFS to store your data, and process it with Map Reduce.
Deadline March 22nd, 2021, 23:59 CEST.
Grace period: one week, until March 29th, midnight.
A3: Sparkling Spark
The third assignment provides exercises to help you understand Spark’s RDDs, its first and foundational API.
- Spark’s RDDs (3)
Deadline April 12th, 2021, 23:59 CEST.
A4: Open Data
The fourth assignment provides exercises to help you understand Spark Datasets, Dataframes and Spark SQL, its second and, especially when handling structured data, much more efficient API.
- Open Data (4)
Deadline April 26th, 2021, 23:59 CEST.
A5: Sparkling Streams (5)
Deadline: May 10th, 2021, 23:59 CEST. (Extended deadline: May 17th, 2021, 23:59 CEST.)
P: Final project
The final project brings together everything we learned in the previous exercises: a big data analysis on a (sample of a) large Web crawl distributed by the CommonCrawl.
A successful project faces a few differences from the way of work so far:
- Run Spark code standalone, outside the notebook interface;
- Scaling up your workload in multiple steps, to tackle the issues that you encounter one by one.
Because cluster configuration took longer to complete than anticipated, we offer two alternatives:
You are free to select the path that is best for you.
If you prefer to finish work by the current deadline (that I will extend with one more week, but of course you are allowed to submit early), then it is fine to complete an alternative to the original project - you would not have to focus on the scaling up, but instead combine part I&II on working with WARCs with the streaming data approaches we experimented with in assignment 5. You should deliver a streaming application that carries out an analysis of (a sample of) WARC data from the Commoncrawl, as an alternative to the batch setting. You do not have to process the full crawl, just a proof of concept; convince us that it would give the same result as the batch alternative, if only you had more runtime.
You do part III as foreseen, but I will put the deadline three weeks after release of part III. Of course you can always hand-in something early. To accommodate all your schedules, there will be a second round as well for students who cannot free up the time now due to BSc theses and exams - details to follow.
The deadline for option I of the final project has been set to Monday June 14th (round 1). The deadline for option II of the final project has been set to Monday July 5th (round 1). Information on how to request an extension (when needed) will follow.
Ask questions in the Lab Sessions Matrix Room.
Finally, for more complex problems where the Matrix room is not a suited medium to explain the problem and the answer has not yet been given in the FAQ, then create an issue on the Forum; use the Forum to get help and/or help out your peers.
If you feel the forum is not the right place for your question, you can always open a private chat after finding Lisa, Niels, Chris and/or me on the Matrix room.