Redbad Cluster Configuration Report
Cluster Setup
The cluster for the 2020 version of the course has been named after King Redbad (a Frisian figure who Bishop Radboud was named after, whom in turn the university is named after).
AWS
AWS setup is a complex DevOp challenge, and expensive when you get it wrong! Checkout the separate AWS page for more info on our continuing struggle.
TA & teacher: find the dashboard under console.aws.amazon.com
.
Spark on AWS
Using Spark on AWS involves the Elastic Map Reduce service. Documentation is pretty good (but extensive); most interesting, AWS has released a whitepaper on teaching Big Data Skills with EMR. Nevertheless, you can spend many many hours reading into getting the details right.
What follows is a summary of key configuration aspects of our cluster Redbad
.
Release
We build upon the latest EMR release, emr-6.0.0
, that provides:
- Hadoop 3.2.1
- Spark 2.4.4
- Zeppelin 0.9.0-SNAPSHOT
Other relevant dependencies include:
- Coretto JDK 8
- Scala 2.12
The Zeppelin image for the final project rubigdata/project
has
been configured to match these software versions and ease project preparations
on local or university machines.
Setup of Assignment
Students accept the final assignment in Github for education in part I of the final project period. They start to work on their project using the course Zeppelin notebook.
Meanwhile, we generate accounts and SSH keys on the Redbad
Master
for students who accepted the assignment. They complete the assignment using the
EMR cluster, in part II of the project period.
CommonCrawl dataset
Info on using the CommonCrawl on Redbad
: CommonCrawl course info.
Opportunities for 2021
Students 2020, feel free to press “back” or skip to “See Also” below.
Possible improvements for the 2021 version of the course:
- Use Livy (0.6.0 in EMR) for job submission;
- Use the novel docker integration to handle dependencies, see this AWS blog post;
- Consider Kerberos authentication - in a way that rights management can be shared for access and HDFS.
Additional extensions that would be cool to explore include:
- Extending the course with a Kafka on EMR or AWS Managed Streaming for Kafka module;
- Extend the course with a tuning lecture;
- Include e-health use-cases like medical records matching and creating and using a medical data lake.
Of course, these do not have to involve EMR, but could make use of our own cluster instead!
See also
Back to assignments overview.