AWS Elastic Map Reduce

We start from the AWS notes on teaching Big Data Skills with EMR (.pdf), and adapt a few settings:

Instance Fleet configuration;
No separate student and teacher IAM accounts;
Reduced access to the application Web UIs.

User Admin Basics

Let’s navigate to the AWS console and start with some basics.

When you create an account, you get a root account; and better first create an IAM user for login to the console. What can be confusing, is that the IAM user will not be able to access the billing info, even if you assign the admin group… unless you give IAM users and federated users with roles permissions to access billing information using a global switch. See also the detailed access control tutorial.

You find your account identifier (tutorial) under the security credentials. (Here, you can also set an alias for the account id, click on “Customize” under Dashboard.)

An account can be part of one organization only, which for the course should be Radboud University; unless we can do without membership through assigning a role as below.

Group setup

Because TAs are temporarily assigned to the course, and we will not use federation (we have not setup a course-specific identity mechanism… yet), it is best to create IAM Users for the TAs, and manage permission using a group (refer to When to Create an IAM User (Instead of a Role) for details).

Create policies

First created a policy to let people manage their MFA setup based on the reference policy example, adding a few access rights, specifically, iam:ChangePassword so you can change your initial password before setting up MFA; this is somewhat insecure but convenient - will remove it after all TAs have setup their accounts.

See the resulting policy file.

Create S3 buckets

We plan to use multiple S3 buckets to store:

cluster management resources;
cluster logs;
student data directories.

Note that you need special settings for logging EMR/Spark/Hadoop to S3: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html

CloudFormation

We configure the cluster using the Cloud Formation service, starting from the examples included in the AWS big data class tutorial, but modifying the cluster configuration (intermediately, testing validity using the Cloud Formation Designer) to our choosing.

Next, go to the service and upload our modified baseline template and use our modified student template, and initiate the cluster. You may be able to see the instances assigned in the EC2 console before cluster initiation has completed.

Finally:

Follow the setup under One AWS Account Hosting One EMR Cluster for All Students (page 21);
Use the provided YAML files for “every student their own cluster”, but then for a single “student” (basically, the management account).
Sort out more restricted policies while we go.

Customizations

The templates have been modified to reduce total cost by making use of spot instance fleets (e.g., online workshop). During the course, we should adapt our instance fleet to the capacity needed (close to the deadline…); adding task nodes and core nodes using the cli.

TODO:

Should we configure custom service roles instead of the default ones?

Node types

According to the AWS [instance families doc][t-instance-fam], D2 or dense storage instances are designed for workloads that require high sequential read and write access to very large data sets. I3 instance types are also storage optimized, but come with SSD and have better throughput. However, looking at the [AWS Spot Advisor][spotadv], these would be often interrupted (>20% probability). Low cost options with low probability (<5%) include R5(D), M5(D) and C5(D), where the D indicate relatively expensive storage using NVMe SSDs - use EBS storage otherwise. Ideally, this info is joined with the current EMR pricing and perhaps also [historic price information][sphist].

TODO:

Configure an auto-scaling policy.

Additional savings:

We should consider reducing cost by using a series of additional nodes taken from reserved instances during the course peak hours;
buying vouchers on e-bay is a known strategy to save upon expenses (should be done by CNCZ to claim the credits in the organization, I suppose).

Also, we should more carefully right-size the cluster to match the workload.

Good topic for a research internship and/or BSc thesis project.

Alternative future setup

Perhaps science accounts can share their identity mechanism through SAML? In that case, we could switch to [IAM Roles][t-roles] in combination with an external ID. Then simply assign the role to the TA, and they can manage the cluster after assuming the role. As TAs might be active in multiple courses, they could fall victim of the confused deputy problem so we would ask the TA to use the course code as external ID (NWI-IBC036).

The policy examples include both EMR and S3 policies that we should build upon.

S3

CommonCrawl dataset on S3 data resides in region us-east-1 with ARN arn:aws:s3:::commoncrawl.

Should we choose the best option between 1) copying over the data to a local S3 while setting up the cluster in us-east-1 and 2) setting up batch-copy or a streaming solution to a cluster in Europe? Highly likely, without advanced pre-selection strategies, staying in us-east-1 will turn out to be the most cost effective.

AWS Cluster Configuration Configuration Report