Open Data

Assignment 4 helps you understand how the Spark SQL and the Spark Dataframe API let you explore and analyze Open Data datasets.

You accept the assignment via Github for Education, using this invitation link

The assignment consists of two parts, where it is our expectation that you may finish part A in the first week of the assignment, and part B in the second week. Make sure you complete part A before you start with part B (the final result of 4A is a prerequisite for 4B).

You find the two notebooks for assignment 4 in your assignment repository. You do not need a new container for assignment 4, simply import the notebooks for part A and B into the same Zeppelin instance as you used in assignment 3.

Blog Post

The assignment is to write a blog post about your own exploration of open data.

The notebooks give suggestions on what to try and explore the given datasets. The most basic version of a blogpost for assignment 4 continues the analysis of the data we already used in the notebook provided. Can you give a more complete list of dated quarters or improve the estimated dates? Include different data, or answer a different question about this data?

Alternatively, you could decide to carry out a small analysis of a different open dataset, of your own choice; and present to your readers some basic properties of that data.

You will notice that it is harder than following instructions, and you run the risk of getting stuck because the data does not parse well, etc. However, the assignment will be much more rewarding!

For example, you are more than welcome to challenge yourself and use Google dataset search to identify another interesting dataset, or integrate additional open data into the Nijmegen set illustrated in the notebook (open data released about the recent elections could be an interesting challenge!). Don’t make it a big data challenge though, as the local cluster mode does not let you scale up for real – and, no worries, the final project has enough to offer from the perspective of scaling up the data analyzed.

Reflect in your blog post on your own data analysis process. When you completed your blog, push the post to the first assignment’s repository. Include a link to the published blog post in the README of the A4 assignment repository. Push the updated README as well as your own code or notebook to the assignment repository.

Back to assignments overview

Open Data Hands-on session with Spark Dataframes

Blog Post