AWS EMR, cloudFormation

registering EMR master node as target to ALB via cloudFormation or CLI

you would like to register your EMR cluster’s master node as a target for the ALB.

Unfortunately, this is not natively possible in CFN.
Because the AWS::EMR::Cluster resource only returns the DNS name of the master node, there is no way to pass either the IP or the Instance Id of the master node to a AWS::ElasticLoadBalancingV2::TargetGroup resource using a !Ref or !GetAtt [1][2].

My guess, many have requested this feature. It has yet to be released. When this feature is released, you will be able to register the MasterNode by DNS name to the ALB without issue.

using bash and CLI to register emr master node as target to ALB

Even outside of CFN it can be a convoluted process to retrieve the Id of the Master Node.
To retrieve the Id manually you need to make a DescribeCluster API call on the cluster, then take the MasterPublicDnsName and use that as a filter on an ec2 DescribeInstances API call [4][5].
Please see an example of how I retrieved the Instance ID of a Master Node below:
dns_name=$(aws emr describe-cluster –cluster-id $clusterid | jq -r ‘.Cluster.MasterPublicDnsName’)
master=$(aws ec2 describe-instances –filters “Name=dns-name, Values=$dns_name” | jq -r ‘.Reservations[] | .Instances[] | .InstanceId’)
echo $master

using cloudFormation and Lambda to register emr master node as target to ALB

The only way to achieve this same kind of effect in CFN is to use a Custom Resource [6].
Custom Resources allow you to create a lambda-backed CFN resource that makes API calls on your behalf that are otherwise unavailable, or not possible, in CFN. I have provided an a good tutorial on how they work and how to create them [7].
In our case, the lambda function would need to make the above types of calls on the EMR cluster to retrieve the DNS name and then MasterNode InstanceId. That information would then need to be passed as parameters in a RegisterTargets API call to the TargetGroup created in the CFN template [8].
If you do decide to go the CFN custom resource route, I recommend including a deletion process in the function to handle removing the targets from the ALB upon deletion of the resource [9]. This clean up process will help avoid dependency errors when terminating the stack.

Basically if you are using python, you can:

get EMR master node DNS via:

import boto3
import datetime

def lambda_handler(event, context):
# TODO implement
client = boto3.client(’emr’)
response = client.describe_cluster(
ClusterId=’j-112345678′
)
response[‘Cluster’][‘MasterPublicDnsName’]
return response[‘Cluster’][‘MasterPublicDnsName’]

given the MasterPublicDnsName get the correlated instance id via:

import boto3

def lambda_handler(event, context):
# TODO implement
client = boto3.client(’emr’)
response = client.describe_cluster(
ClusterId=’j-YJ9Z2ZMU0DJM’
)
#this is the dns of master node in emr
masterNodeDns =response[‘Cluster’][‘MasterPublicDnsName’]

client2 = boto3.client(‘ec2’)
response2 = client2.describe_instances(
Filters=[
{
‘Name’: ‘dns-name’,
‘Values’: [
masterNodeDns
]
},
],
MaxResults=123
)
MasterNodeInstanceID = response2[‘Reservations’][0][‘Instances’][0][‘InstanceId’]

return MasterNodeInstanceID

register targets via:

    client3= boto3.client('elbv2')
    response3 = client3.register_targets(
    TargetGroupArn='arn:aws:elasticloadbalancing:eu-west-1:506754145427:targetgroup/zeppeling-stg-target/a4160357f4e8daff',
    Targets=[
        {
            'Id': MasterNodeInstanceID
        }
    ],
    )

and deregister

response = client.deregister_targets(
    TargetGroupArn='string',
    Targets=[
        {
            'Id': 'string',
            'Port': 123,
            'AvailabilityZone': 'string'
        },
    ]
)

 

So the full lambda will look like (notice the hardcoded stackname StgEMR):

import boto3

def lambda_handler(event, context):

#get the cluster ID given CloudFormation StackName
client4 = boto3.client(‘cloudformation’)
response4 = client4.describe_stack_resource(
StackName=’StgEMR’,
LogicalResourceId=’EMRCluster’
)
CloudFormationStackClusterID = response4[‘StackResourceDetail’][‘PhysicalResourceId’]

#get the EmrMasterNodeDns give cluster ID
client = boto3.client(’emr’)
response = client.describe_cluster(
ClusterId=CloudFormationStackClusterID
)
#this is the dns of master node in emr
masterNodeDns =response[‘Cluster’][‘MasterPublicDnsName’]

#get Instance ID of EMR master node given masterNodeDNS
client2 = boto3.client(‘ec2’)
response2 = client2.describe_instances(
Filters=[
{
‘Name’: ‘dns-name’,
‘Values’: [
masterNodeDns
]
},
],
MaxResults=123
)
MasterNodeInstanceID = response2[‘Reservations’][0][‘Instances’][0][‘InstanceId’]

#add instace to target of ALB given instance ID
client3= boto3.client(‘elbv2’)
response3 = client3.register_targets(
TargetGroupArn=’arn:aws:elasticloadbalancing:eu-west-1:506754145427:targetgroup/zeppeling-stg-target/a4160357f4e8daff’,
Targets=[
{
‘Id’: MasterNodeInstanceID
}
],
)
return “Done”

 

you could ran the lamda from cloudFormation as follows (recommend by cloudFormation support team).  However , I didn’t test it simply b/c it require launching the entire stack again and again until u get it right. i simply scheduled the lambda to run daily 15 min after the EMR was launched. It is a hack, but easier to get started.

lambda inside cloud formation should look like:

“mylambda”: {
“Type”: “AWS::Lambda::Function”,
“Properties”: {
“Handler”: “index.lambda_handler”,
“Role”: { “Fn::GetAtt” : [“LambdaExecutionRole”, “Arn”] },
“Code”: {
“S3Bucket”: “my-lambda-functions-bucket”,
“S3Key”: “mylambda.zip”
},
“Runtime”: “python3.6”,
“Timeout”: “100”,
“Environment”: {
“Variables”: {
“givenDns”: { “Fn::GetAtt”: [“EMRCluster”, “MasterPublicDNS”] }
}
}
}
},
“LambdaExecutionRole”: {
“Type”: “AWS::IAM::Role”,
“Properties”: {
“AssumeRolePolicyDocument”: {
“Version”: “2012-10-17”,
“Statement”: [{
“Effect”: “Allow”,
“Principal”: {“Service”: [“lambda.amazonaws.com”]},
“Action”: [“sts:AssumeRole”]
}]
},
“Path”: “/”,
“Policies”: [{
“PolicyName”: “root”,
“PolicyDocument”: {
“Version”: “2012-10-17”,
“Statement”: [{
“Effect”: “Allow”,
“Action”: [“logs:CreateLogGroup”,”logs:CreateLogStream”,”logs:PutLogEvents”],
“Resource”: “arn:aws:logs:*:*:*”
},
{
“Effect”: “Allow”,
“Action”: [“ec2:*”],
“Resource”: “*”
}]
}
}]
}
}

 

[1] https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-emr-cluster.html#aws-resource-emr-cluster-returnvalues
[2] https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype
[3] https://aws.amazon.com/new/
[4] https://docs.aws.amazon.com/cli/latest/reference/emr/describe-cluster.html
[5] https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-instances.html
[6] https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/template-custom-resources.html
[7] https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/walkthrough-custom-resources-lambda-lookup-amiids.html
[8] https://docs.aws.amazon.com/elasticloadbalancing/latest/APIReference/API_RegisterTargets.html
[9] https://docs.aws.amazon.com/elasticloadbalancing/latest/APIReference/API_DeregisterTargets.html

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS EMR, cloudFormation

Bootstrapping EMR from 0900 to 1700 on each work day with AWS Cloud Formation and AWS Data Pipelines

This article covers some options to bootstrap a daily AWS EMR cluster to work continuously from 0900 in the morning to 17:00 , e.g office hours .

I will be showing two options:

  1. using data pipelines and AWS CMD
  2. using Cloud Formation and EMR

The Cluster will consist of:

  1. one master node on demand.
  2. one data node, on demand, no autoscaling.
  3. task group with auto scaling, using SPOT instances.
  4. Applications: Spark, Hive, Presto, Ganglia and more.
  5. One step action for my custom steps. this will include all my custom configurations such as glue connectors, maximise resource allocation etc.
  6. One DNS xxx.myDomain.com that will forwarded to the masterNodePublicDNS. this is useful if you have actual employees using this cluster from 0900 to 17:00 , you want to be able to have a look and feel of “always on cluster” by letting them query xxx.myDomain.com instead of the AWS EMR masterDNS

 

important node: use use https://jsonformatter.curiousconcept.com/ to reformat the below JSONS’s easily.

Option 1: using AWS Data Pipelines to bootstrap AWS EMR

  1. use data pipelines to lunch an EMR cluster,  with task group, auto scaling, glue connectors, and maximize resources config for spark, you will need a cmd that would look like :

aws emr create-cluster –auto-scaling-role EMR_AutoScaling_DefaultRole –applications Name=Ganglia Name=Spark Name=Hive Name=Tez Name=Zeppelin Name=Oozie Name=Hue Name=Presto Name=Livy –ec2-attributes ‘{“KeyName”:”walla_omid”,”AdditionalSlaveSecurityGroups”:[“sg-a22c”],”InstanceProfile”:”sampleOmid53-EMRClusterinstanceProfile-U080RX3ACCZT”,”SubnetId”:”subnet-222″,”EmrManagedSlaveSecurityGroup”:”sg-222″,”EmrManagedMasterSecurityGroup”:”sg-22″,”AdditionalMasterSecurityGroups”:[“sg-22”]}’ –service-role sampleOmid53-EMRClusterServiceRole-KWO13FMZNHF2 –release-label emr-5.13.0 –log-uri ‘s3n://aws-logs-12344-eu-west-1/elasticmapreduce/’ –steps ‘[{“Args”:[“s3://emr-bootstrap/MyBbootstrap-emr.sh”],”Type”:”CUSTOM_JAR”,”ActionOnFailure”:”CONTINUE”,”Jar”:”s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar”,”Properties”:””,”Name”:”Custom JAR”}]’ –name ‘myEmrCluster’ –instance-groups ‘[{“InstanceCount”:1,”EbsConfiguration”:{“EbsBlockDeviceConfigs”:[{“VolumeSpecification”:{“SizeInGB”:32,”VolumeType”:”gp2″},”VolumesPerInstance”:1}]},”InstanceGroupType”:”CORE”,”InstanceType”:”r4.xlarge”,”Name”:”Core”},{“InstanceCount”:1,”EbsConfiguration”:{“EbsBlockDeviceConfigs”:[{“VolumeSpecification”:{“SizeInGB”:32,”VolumeType”:”gp2″},”VolumesPerInstance”:1}]},”InstanceGroupType”:”MASTER”,”InstanceType”:”r4.xlarge”,”Name”:”Master”},{“InstanceCount”:0,”BidPrice”:”15″,”EbsConfiguration”:{“EbsBlockDeviceConfigs”:[{“VolumeSpecification”:{“SizeInGB”:50,”VolumeType”:”gp2″},”VolumesPerInstance”:1}],”EbsOptimized”:true},”InstanceGroupType”:”TASK”,”InstanceType”:”r4.xlarge”,”Name”:”TaskSpotsNinja”}]’ –configurations ‘[{“Classification”:”hive-site”,”Properties”:{“hive.metastore.client.factory.class”:”com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”},”Configurations”:[]},{“Classification”:”spark”,”Properties”:{“maximizeResourceAllocation”:”true”},”Configurations”:[]},{“Classification”:”presto-connector-hive”,”Properties”:{“hive.metastore.glue.datacatalog.enabled”:”true”},”Configurations”:[]},{“Classification”:”spark-hive-site”,”Properties”:{“hive.metastore.client.factory.class”:”com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”},”Configurations”:[]}]’ –scale-down-behavior TERMINATE_AT_TASK_COMPLETION –region eu-west-1

2.if your cluster already exists, use aws emr cli to list cluster

aws emr list-clusters –active –output text | grep CLUSTERS 
3. given the cluster id from previous steps, use aws emr cli to describe cluster and confirm the tags are the tags of resources you need.

aws emr describe-cluster –cluster-id j-1124HDDG47D1 –output text | grep TAGS

4. from here, I am pretty sure, you proceed on your own and create a script, once you have all the ID’s u need.

Option 1 Summery:

  1. a fairly quick and simple way to manage clusters
  2. you need to manage the ID’s and instance id’s on your on as input/ouput per each step of your workflow (create emr, task resources, assigning LB, assigning DNS etc)

Option 2: Using Cloud Formation to bootstrap EMR

Another options would be to use cloud formation ,work hard to create the configuration JSON that tells the CF stack what to do,  and the stack will take care of the stop/start of the correct resources.

once you have the JSON, you can schedule a lambda to start stack on the required time such as 0900 (cloud watch trigger). example of a lambda function to start  a CF stack:

def lambda_handler(event, context)
# TODO implement
#return ‘Hello from Lambda’
client = boto3.client(‘cloudformation’)
response = client.create_stack
StackName=’DevEMR’
return(response)

and an example of lambda terminate a stack:

import boto 
def lambda_handler(event, context)
# TODO implement 
#return ‘Hello from Lambda’ 
client = boto3.client(‘cloudformation’) 
response = client.delete_stack(StackName=’DevEMR’)

example of a stack temple, it give you a good sense, bit too specific, but good place to get started:

https://github.com/awslabs/aws-cloudformation-templates/blob/master/aws/services/EMR/EMRCLusterGangliaWithSparkOrS3backedHbase.json

 

Since it is very easy to makes mistakes in CloudFormation, i have attache several examples of clusters, each example, adds something new to the cluster. this way , you can take the basic example below and start adding, and comparing to what i did.

another working example is for EMR cluster with many apps selected, with no instance group and auto scaling.

{
“AWSTemplateFormatVersion”: “2010-09-09”,
“Description”: “myEmrCluster”,
“Parameters”: {
“EMRClusterName”: {
“Description”: “Name of the cluster”,
“Type”: “String”,
“Default”: “myEmrCluster”
},
“KeyName”: {
“Description”: “Must be an existing Keyname”,
“Type”: “String”,
“Default”: “walla_omid”

},
“MasterInstacneType”: {
“Description”: “Instance type to be used for the master instance.”,
“Type”: “String”,
“Default”: “r4.xlarge”
},
“CoreInstanceType”: {
“Description”: “Instance type to be used for core instances.”,
“Type”: “String”,
“Default”: “r4.xlarge”
},
“NumberOfCoreInstances”: {
“Description”: “Must be a valid number”,
“Type”: “Number”,
“Default”: 1
},
“SubnetID”: {
“Description”: “Must be Valid public subnet ID”,
“Default”: “subnet-012344e”,
“Type”: “String”
},
“LogUri”: {
“Description”: “Must be a valid S3 URL”,
“Default”: “s3://aws-logs-12313231eu-west-1/elasticmapreduce/”,
“Type”: “String”
},
“S3DataUri”: {
“Description”: “Must be a valid S3 bucket URL “,
“Default”: “s3://aws-logs-12131212-eu-west-1/elasticmapreduce/”,
“Type”: “String”
},
“ReleaseLabel”: {
“Description”: “Must be a valid EMR release version”,
“Default”: “emr-5.13.0”,
“Type”: “String”
},
“Applications”: {
“Description”: “Cluster setup:”,
“Type”: “String”,
“AllowedValues”: [
“Spark”,
“TBD”
]
}
},
“Mappings”: {},
“Conditions”: {
“Spark”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Spark”
]
},
“Hbase”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Hbase”
]
}
},
“Resources”: {
“EMRCluster”: {
“DependsOn”: [
“EMRClusterServiceRole”,
“EMRClusterinstanceProfileRole”,
“EMRClusterinstanceProfile”
],
“Type”: “AWS::EMR::Cluster”,
“Properties”: {
“Applications”: [
{
“Name”: “Ganglia”
},
{
“Name”: “Spark”
},
{
“Name”: “Hive”
},
{
“Name”: “Tez”
},
{
“Name”: “Zeppelin”
},
{
“Name”: “Oozie”
},
{
“Name”: “Hue”
},
{
“Name”: “Presto”
},
{
“Name”: “Livy”
}
],
“Configurations”: [
{“Classification”:”hive-site”, “ConfigurationProperties”:{“hive.metastore.client.factory.class”:”com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”}},
{“Classification”:”spark”, “ConfigurationProperties”:{“maximizeResourceAllocation”:”true”}},
{“Classification”:”presto-connector-hive”, “ConfigurationProperties”:{“hive.metastore.glue.datacatalog.enabled”:”true”}},
{“Classification”:”spark-hive-site”, “ConfigurationProperties”:{“hive.metastore.client.factory.class”:”com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”}}

],
“Instances”: {
“Ec2KeyName”: {
“Ref”: “KeyName”
},
“Ec2SubnetId”: {
“Ref”: “SubnetID”
},
“AdditionalMasterSecurityGroups” : [ “sg-ad1234” ],
“AdditionalSlaveSecurityGroups” : [ “sg-aa41234” ],
“MasterInstanceGroup”: {
“InstanceCount”: 1,
“InstanceType”: {
“Ref”: “MasterInstacneType”
},
“Market”: “ON_DEMAND”,
“Name”: “Master”
},
“CoreInstanceGroup”: {
“InstanceCount”: {
“Ref”: “NumberOfCoreInstances”
},
“InstanceType”: {
“Ref”: “CoreInstanceType”
},
“Market”: “ON_DEMAND”,
“Name”: “Core”
},
“TerminationProtected”: false
},
“VisibleToAllUsers”: true,
“JobFlowRole”: {
“Ref”: “EMRClusterinstanceProfile”
},
“ReleaseLabel”: {
“Ref”: “ReleaseLabel”
},
“LogUri”: {
“Ref”: “LogUri”
},
“Name”: {
“Ref”: “EMRClusterName”
},
“AutoScalingRole”: “EMR_AutoScaling_DefaultRole”,
“ServiceRole”: {
“Ref”: “EMRClusterServiceRole”
}
}
},
“EMRClusterServiceRole”: {
“Type”: “AWS::IAM::Role”,
“Properties”: {
“AssumeRolePolicyDocument”: {
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“elasticmapreduce.amazonaws.com”
]
},
“Action”: [
“sts:AssumeRole”
]
}
]
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole”
],
“Path”: “/”
}
},
“EMRClusterinstanceProfileRole”: {
“Type”: “AWS::IAM::Role”,
“Properties”: {
“AssumeRolePolicyDocument”: {
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“ec2.amazonaws.com”
]
},
“Action”: [
“sts:AssumeRole”
]
}
]
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role”
],
“Path”: “/”
}
},
“EMRClusterinstanceProfile”: {
“Type”: “AWS::IAM::InstanceProfile”,
“Properties”: {
“Path”: “/”,
“Roles”: [
{
“Ref”: “EMRClusterinstanceProfileRole”
}
]
}
}
},
“Outputs”: {}
}

 

Quick note: Cloud former and EMR – not recommended…

Another way to get the JSON for cloud formation is to use cloud former, it uses existing setup and reverse engineers a json for it. below is the documentation. but before you jump right to it… EMR is now supported in cloud former :(.  you can use it for

1) VPCs
2) VPC Network (VPC Subnets, Internet Gateways, Customer Gateways, DHCP options)
3) VPC Security (Network ACLs, Rote Tables)
4) Network (ELB, Elastic IP , Network Interfaces)
5) Compute (Auto Scaling Groups, EC2 Instances)
6) Storage (EBS Volumes, RDS Instances, DynamoDB Tables, S3 Buckets)
7) Services (SQS, SNS Topics, SimpleDB Domains)
8) Config (Auto Scaling Launch Configurations, RDS Subnet groups, RDS Parameter Groups)
9) Security (EC2 Security Groups, RDS Security Groups, SQS Queue Policies, SNS Tpic Policies, S3 Bucket Policies)
10)Optional Resources (AutoScaling Policies, CloudWatch Alarms)

However, EMR is not yet supported in cloud former. and I have created a Feature Request to the internal team to see if they can implement in. However this service has been in beta since 2015, so it might be a while before EMR support comes out.

Documentation of Cloud former:

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-using-cloudformer.html

Or using this youtube video blog 🙂

 

Advanced EMR cluster bootstrapping using Cloud Formation example of json

as there is no good compressive examples for AWS EMR bootstrapping with all the different options and the fact, the it takes a lot of time do debug each time. I am contributing this JSON we are using internally to AWS Support for them to publish on their online resources:

This AWS EMR cluster will contain:

1 master node (on demand)

1 data node (on demand)

1 task node (spot)

Auto scaling – scale in/out

apps:spark,hive,presto and more…

config: maximizeResourceAllocation, glue for spark/hive/presto

 

again, use https://jsonformatter.curiousconcept.com/ to reformat the below json easily.

{
“AWSTemplateFormatVersion”: “2010-09-09”,
“Conditions”: {
“Hbase”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Hbase”
]
},
“Spark”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Spark”
]
}
},
“Description”: “myEmrCluster”,
“Mappings”: {},
“Outputs”: {},
“Parameters”: {
“Applications”: {
“AllowedValues”: [
“Spark”,
“TBD”
],
“Description”: “Cluster setup:”,
“Type”: “String”
},
“CoreInstanceType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for core instances.”,
“Type”: “String”
},
“EMRClusterName”: {
“Default”: “myEmrCluster”,
“Description”: “Name of the cluster”,
“Type”: “String”
},
“KeyName”: {
“Default”: “walla_omid”,
“Description”: “Must be an existing Keyname”,
“Type”: “String”
},
“LogUri”: {
“Default”: “s3://aws-logs-111111111-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 URL”,
“Type”: “String”
},
“MasterInstacneType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for the master instance.”,
“Type”: “String”
},
“NumberOfCoreInstances”: {
“Default”: 1,
“Description”: “Must be a valid number”,
“Type”: “Number”
},
“ReleaseLabel”: {
“Default”: “emr-5.13.0”,
“Description”: “Must be a valid EMR release version”,
“Type”: “String”
},
“S3DataUri”: {
“Default”: “s3://aws-logs-1111111-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 bucket URL “,
“Type”: “String”
},
“SubnetID”: {
“Default”: “subnet-123456e”,
“Description”: “Must be Valid public subnet ID”,
“Type”: “String”
}
},
“Resources”: {
“EMRCluster”: {
DependsOn“: [
“EMRClusterServiceRole”,
“EMRClusterinstanceProfileRole”,
“EMRClusterinstanceProfile”
],
“Properties”: {
“Applications”: [
{
“Name”: “Ganglia”
},
{
“Name”: “Spark”
},
{
“Name”: “Hive”
},
{
“Name”: “Tez”
},
{
“Name”: “Zeppelin”
},
{
“Name”: “Oozie”
},
{
“Name”: “Hue”
},
{
“Name”: “Presto”
},
{
“Name”: “Livy”
}
],
“AutoScalingRole”: “EMR_AutoScaling_DefaultRole”,
“Configurations”: [
{
“Classification”: “hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
},
{
“Classification”: “spark”,
“ConfigurationProperties”: {
“maximizeResourceAllocation”: “true”
}
},
{
“Classification”: “presto-connector-hive”,
“ConfigurationProperties”: {
“hive.metastore.glue.datacatalog.enabled”: “true”
}
},
{
“Classification”: “spark-hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
}
],
“Instances”: {
“AdditionalMasterSecurityGroups”: [
“sg-1234”
],
“AdditionalSlaveSecurityGroups”: [
“sg-1234”
],
“CoreInstanceGroup”: {
“InstanceCount”: {
“Ref”: “NumberOfCoreInstances”
},
“InstanceType”: {
“Ref”: “CoreInstanceType”
},
“Market”: “ON_DEMAND”,
“Name”: “Core”
},
“Ec2KeyName”: {
“Ref”: “KeyName”
},
“Ec2SubnetId”: {
“Ref”: “SubnetID”
},
“MasterInstanceGroup”: {
“InstanceCount”: 1,
“InstanceType”: {
“Ref”: “MasterInstacneType”
},
“Market”: “ON_DEMAND”,
“Name”: “Master”
},
“TerminationProtected”: false
},
“JobFlowRole”: {
“Ref”: “EMRClusterinstanceProfile”
},
“LogUri”: {
“Ref”: “LogUri”
},
“Name”: {
“Ref”: “EMRClusterName”
},
“ReleaseLabel”: {
“Ref”: “ReleaseLabel”
},
“ServiceRole”: {
“Ref”: “EMRClusterServiceRole”
},
“VisibleToAllUsers”: true
},
“Type”: “AWS::EMR::Cluster”
},
“EMRClusterInstanceGroupConfig”: {
“Properties”: {
“Market”:”SPOT”,
“AutoScalingPolicy”: {
“Constraints”: {
“MaxCapacity”: 4,
“MinCapacity”: 0
},
“Rules”: [
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 2
}
},
“Description”: “yarn-scale-out2”,
“Name”: “yarn-scale-out2”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “LESS_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 20
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: -1
}
},
“Description”: “yarn-scale-in1”,
“Name”: “yarn-scale-in1”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 80
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 2
}
},
“Description”: “con-scale-out”,
“Name”: “con-scale-out”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 12,
“MetricName”: “ContainerPendingRatio”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 0.75
}
}
}
]
},
“BidPrice”: “15”,
“EbsConfiguration”: {
“EbsBlockDeviceConfigs”: [
{
“VolumeSpecification”: {
“SizeInGB”: “50”,
“VolumeType”: “gp2”
},
“VolumesPerInstance”: “1”
}
],
“EbsOptimized”: “true”
},
“InstanceCount”: 1,
“InstanceRole”: “TASK”,
“InstanceType”: “r4.xlarge”,
“JobFlowId”: {
“Ref”: “EMRCluster”
},
“Name”: “TaskSpotsNinja”
},
“Type”: “AWS::EMR::InstanceGroupConfig”
},
“EMRClusterServiceRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“elasticmapreduce.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
},
“EMRClusterinstanceProfile”: {
“Properties”: {
“Path”: “/”,
“Roles”: [
{
“Ref”: “EMRClusterinstanceProfileRole”
}
]
},
“Type”: “AWS::IAM::InstanceProfile”
},
“EMRClusterinstanceProfileRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“ec2.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
}
}
}

 

Once the cluster is up you need to run steps to automate your cluster needs

documentation for creating a step that runs a bash script:

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html

you can do it also via

  1. the GUI
  2. the UI.
    1. go to steps, add new step:
    2. JAR location: s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar
    3. arguments: s3://emr-bootstrap/my-bootstrap-emr.sh
    4. “Add”, and wait 🙂
  3. Cloud Formation

 

addgin the step to the same cluster from above will change the jsons as follows:

{
“AWSTemplateFormatVersion”: “2010-09-09”,
“Conditions”: {
“Hbase”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Hbase”
]
},
“Spark”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Spark”
]
}
},
“Description”: “myEmrCluster”,
“Mappings”: {},
“Outputs”: {},
“Parameters”: {
“Applications”: {
“AllowedValues”: [
“Spark”,
“TBD”
],
“Description”: “Cluster setup:”,
“Type”: “String”
},
“CoreInstanceType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for core instances.”,
“Type”: “String”
},
“EMRClusterName”: {
“Default”: “myEmrCluster”,
“Description”: “Name of the cluster”,
“Type”: “String”
},
“KeyName”: {
“Default”: “walla_omid”,
“Description”: “Must be an existing Keyname”,
“Type”: “String”
},
“LogUri”: {
“Default”: “s3://aws-logs-1234-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 URL”,
“Type”: “String”
},
“MasterInstacneType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for the master instance.”,
“Type”: “String”
},
“NumberOfCoreInstances”: {
“Default”: 1,
“Description”: “Must be a valid number”,
“Type”: “Number”
},
“ReleaseLabel”: {
“Default”: “emr-5.13.0”,
“Description”: “Must be a valid EMR release version”,
“Type”: “String”
},
“S3DataUri”: {
“Default”: “s3://aws-logs-23rt-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 bucket URL “,
“Type”: “String”
},
“SubnetID”: {
“Default”: “subnet-12345”,
“Description”: “Must be Valid public subnet ID”,
“Type”: “String”
}
},
“Resources”: {
“EMRCluster”: {
“DependsOn”: [
“EMRClusterServiceRole”,
“EMRClusterinstanceProfileRole”,
“EMRClusterinstanceProfile”
],
“Properties”: {

“Applications”: [
{
“Name”: “Ganglia”
},
{
“Name”: “Spark”
},
{
“Name”: “Hive”
},
{
“Name”: “Tez”
},
{
“Name”: “Zeppelin”
},
{
“Name”: “Oozie”
},
{
“Name”: “Hue”
},
{
“Name”: “Presto”
},
{
“Name”: “Livy”
}
],
“AutoScalingRole”: “EMR_AutoScaling_DefaultRole”,
“Configurations”: [
{
“Classification”: “hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
},
{
“Classification”: “spark”,
“ConfigurationProperties”: {
“maximizeResourceAllocation”: “true”
}
},
{
“Classification”: “presto-connector-hive”,
“ConfigurationProperties”: {
“hive.metastore.glue.datacatalog.enabled”: “true”
}
},
{
“Classification”: “spark-hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
}
],
“Instances”: {
“AdditionalMasterSecurityGroups”: [
“sg-1234”
],
“AdditionalSlaveSecurityGroups”: [
“sg-12345”
],
“CoreInstanceGroup”: {
“InstanceCount”: {
“Ref”: “NumberOfCoreInstances”
},
“InstanceType”: {
“Ref”: “CoreInstanceType”
},
“Market”: “ON_DEMAND”,
“Name”: “Core”
},
“Ec2KeyName”: {
“Ref”: “KeyName”
},
“Ec2SubnetId”: {
“Ref”: “SubnetID”
},
“MasterInstanceGroup”: {
“InstanceCount”: 1,
“InstanceType”: {
“Ref”: “MasterInstacneType”
},
“Market”: “ON_DEMAND”,
“Name”: “Master”
},
“TerminationProtected”: false
},
“JobFlowRole”: {
“Ref”: “EMRClusterinstanceProfile”
},
“LogUri”: {
“Ref”: “LogUri”
},
“Name”: {
“Ref”: “EMRClusterName”
},
“ReleaseLabel”: {
“Ref”: “ReleaseLabel”
},
“ServiceRole”: {
“Ref”: “EMRClusterServiceRole”
},
“VisibleToAllUsers”: true
},
“Type”: “AWS::EMR::Cluster”
},
“EMRClusterInstanceGroupConfig”: {
DependsOn“: “EMRCluster”,
“Properties”: {
“Market”:”SPOT”,
“AutoScalingPolicy”: {
“Constraints”: {
“MaxCapacity”: 4,
“MinCapacity”: 0
},
“Rules”: [
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 2
}
},
“Description”: “yarn-scale-out2”,
“Name”: “yarn-scale-out2”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “LESS_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 20
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: -1
}
},
“Description”: “yarn-scale-in1”,
“Name”: “yarn-scale-in1”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 80
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 2
}
},
“Description”: “con-scale-out”,
“Name”: “con-scale-out”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 12,
“MetricName”: “ContainerPendingRatio”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 0.75
}
}
}
]
},
“BidPrice”: “15”,
“EbsConfiguration”: {
“EbsBlockDeviceConfigs”: [
{
“VolumeSpecification”: {
“SizeInGB”: “50”,
“VolumeType”: “gp2”
},
“VolumesPerInstance”: “1”
}
],
“EbsOptimized”: “true”
},
“InstanceCount”: 1,
“InstanceRole”: “TASK”,
“InstanceType”: “r4.xlarge”,
“JobFlowId”: {
“Ref”: “EMRCluster”
},
“Name”: “TaskSpotsNinja”
},
“Type”: “AWS::EMR::InstanceGroupConfig”
},
“EMRClusterServiceRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“elasticmapreduce.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
},
“EMRClusterinstanceProfile”: {
“Properties”: {
“Path”: “/”,
“Roles”: [
{
“Ref”: “EMRClusterinstanceProfileRole”
}
]
},
“Type”: “AWS::IAM::InstanceProfile”
},
“EMRClusterinstanceProfileRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“ec2.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
},
“TestStep”: {
“Type”: “AWS::EMR::Step”,
“Properties”: {
“ActionOnFailure”: “CONTINUE”,
“HadoopJarStep”: {
“Args”: [
“s3://emr-bootstrap/Mybootstrap-emr.sh”
],
“Jar”: “s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar”
},
“Name”: “CustomBootstrap”,
“JobFlowId”: {
“Ref”: “EMRCluster”
}
}
}
}
}

 

Another upgrade to the above cluster is just adding a DNS came for the master. this is useful when you have a team of analysts, connecting to this 0900 to 1700 cluster every day, and you don’t want to really change the JDBC settings every day 🙂 so just create a DNS CNAME for the master node of EMR.

{
“AWSTemplateFormatVersion”: “2010-09-09”,
“Conditions”: {
“Hbase”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Hbase”
]
},
“Spark”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Spark”
]
}
},
“Description”: “myEmrCluster”,
“Mappings”: {},
“Outputs”: {},
“Parameters”: {
“Applications”: {
“AllowedValues”: [
“Spark”,
“TBD”
],
“Description”: “Cluster setup:”,
“Type”: “String”
},
“CoreInstanceType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for core instances.”,
“Type”: “String”
},
“EMRClusterName”: {
“Default”: “myEmrCluster”,
“Description”: “Name of the cluster”,
“Type”: “String”
},
“KeyName”: {
“Default”: “walla_omid”,
“Description”: “Must be an existing Keyname”,
“Type”: “String”
},
“LogUri”: {
“Default”: “s3://aws-logs-506754145427-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 URL”,
“Type”: “String”
},
“MasterInstacneType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for the master instance.”,
“Type”: “String”
},
“NumberOfCoreInstances”: {
“Default”: 1,
“Description”: “Must be a valid number”,
“Type”: “Number”
},
“ReleaseLabel”: {
“Default”: “emr-5.13.0”,
“Description”: “Must be a valid EMR release version”,
“Type”: “String”
},
“S3DataUri”: {
“Default”: “s3://aws-logs-506754145427-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 bucket URL “,
“Type”: “String”
},
“SubnetID”: {
“Default”: “subnet-0647325e”,
“Description”: “Must be Valid public subnet ID”,
“Type”: “String”
}
},
“Resources”: {
“EMRCluster”: {
DependsOn“: [
“EMRClusterServiceRole”,
“EMRClusterinstanceProfileRole”,
“EMRClusterinstanceProfile”
],
“Properties”: {

“Applications”: [
{
“Name”: “Ganglia”
},
{
“Name”: “Spark”
},
{
“Name”: “Hive”
},
{
“Name”: “Tez”
},
{
“Name”: “Zeppelin”
},
{
“Name”: “Oozie”
},
{
“Name”: “Hue”
},
{
“Name”: “Presto”
},
{
“Name”: “Livy”
}
],
“AutoScalingRole”: “EMR_AutoScaling_DefaultRole”,
“Configurations”: [
{
“Classification”: “hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
},
{
“Classification”: “spark”,
“ConfigurationProperties”: {
“maximizeResourceAllocation”: “true”
}
},
{
“Classification”: “presto-connector-hive”,
“ConfigurationProperties”: {
“hive.metastore.glue.datacatalog.enabled”: “true”
}
},
{
“Classification”: “spark-hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
}
],
“Instances”: {
“AdditionalMasterSecurityGroups”: [
“sg-ad4e13cb”
],
“AdditionalSlaveSecurityGroups”: [
“sg-aa4e13cc”
],
“CoreInstanceGroup”: {
“InstanceCount”: {
“Ref”: “NumberOfCoreInstances”
},
“InstanceType”: {
“Ref”: “CoreInstanceType”
},
“Market”: “ON_DEMAND”,
“Name”: “Core”
},
“Ec2KeyName”: {
“Ref”: “KeyName”
},
“Ec2SubnetId”: {
“Ref”: “SubnetID”
},
“MasterInstanceGroup”: {
“InstanceCount”: 1,
“InstanceType”: {
“Ref”: “MasterInstacneType”
},
“Market”: “ON_DEMAND”,
“Name”: “Master”
},
“TerminationProtected”: false
},
“JobFlowRole”: {
“Ref”: “EMRClusterinstanceProfile”
},
“LogUri”: {
“Ref”: “LogUri”
},
“Name”: {
“Ref”: “EMRClusterName”
},
“ReleaseLabel”: {
“Ref”: “ReleaseLabel”
},
“ServiceRole”: {
“Ref”: “EMRClusterServiceRole”
},
“VisibleToAllUsers”: true
},
“Type”: “AWS::EMR::Cluster”
},
“EMRClusterInstanceGroupConfig”: {
DependsOn“: “EMRCluster”,
“Properties”: {
“Market”:”SPOT”,
“AutoScalingPolicy”: {
“Constraints”: {
“MaxCapacity”: 4,
“MinCapacity”: 0
},
“Rules”: [
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 2
}
},
“Description”: “yarn-scale-out2”,
“Name”: “yarn-scale-out2”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “LESS_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 20
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: -1
}
},
“Description”: “yarn-scale-in1”,
“Name”: “yarn-scale-in1”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 80
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 2
}
},
“Description”: “con-scale-out”,
“Name”: “con-scale-out”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 12,
“MetricName”: “ContainerPendingRatio”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 0.75
}
}
}
]
},
“BidPrice”: “15”,
“EbsConfiguration”: {
“EbsBlockDeviceConfigs”: [
{
“VolumeSpecification”: {
“SizeInGB”: “50”,
“VolumeType”: “gp2”
},
“VolumesPerInstance”: “1”
}
],
“EbsOptimized”: “true”
},
“InstanceCount”: 1,
“InstanceRole”: “TASK”,
“InstanceType”: “r4.xlarge”,
“JobFlowId”: {
“Ref”: “EMRCluster”
},
“Name”: “TaskSpotsNinja”
},
“Type”: “AWS::EMR::InstanceGroupConfig”
},
“EMRClusterServiceRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“elasticmapreduce.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
},
“EMRClusterinstanceProfile”: {
“Properties”: {
“Path”: “/”,
“Roles”: [
{
“Ref”: “EMRClusterinstanceProfileRole”
}
]
},
“Type”: “AWS::IAM::InstanceProfile”
},
“EMRClusterinstanceProfileRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“ec2.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
},
“TestStep”: {
“Type”: “AWS::EMR::Step”,
“Properties”: {
“ActionOnFailure”: “CONTINUE”,
“HadoopJarStep”: {
“Args”: [
“s3://byoo-emr-bootstrap/bootstrap-emr.sh”
],
“Jar”: “s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar”
},
“Name”: “CustomBootstrap”,
“JobFlowId”: {
“Ref”: “EMRCluster”
}
}
},
“myDNSRecord” : {
“Type” : “AWS::Route53::RecordSet”,
“Properties” : {
“HostedZoneName” : “b-yoo.net.”,
“Comment” : “DNS name for my instance. for emr. cloud formation”,
“Name” : “xxx.myDomain.com”,
“Type” : “CNAME”,
“TTL” : “600”,
“ResourceRecords” : [ { “Fn::GetAtt” : [ “EMRCluster”, “MasterPublicDNS” ] } ]
}
}
}
}

You may have conceded using ALB  on top of EMR, but currently CloudFormation, does not return the instance ID of the master node. you only get the Master node public dns, so u can only create CNAME for it using route 53. you can involve lambda some code to get thing moving, but there is an open feature request to resolve this issue, so you may want to hold on. 🙂

quick note about deleting stack

you may want to add dependsOn attribute to help cloud Formation properly delete the resources on rollback or delete stack. attaching the same json with depends on. this json also a small mistake we did in previous jsons in the autoscaling, i forgot to add the “UNIT” attribute to make sure the value is considered as percentage

{
“AWSTemplateFormatVersion”: “2010-09-09”,
“Conditions”: {
“Hbase”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Hbase”
]
},
“Spark”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Spark”
]
}
},
“Description”: “myEmrCluster”,
“Mappings”: {},
“Outputs”: {},
“Parameters”: {
“Applications”: {
“AllowedValues”: [
“Spark”,
“TBD”
],
“Description”: “Cluster setup:”,
“Type”: “String”
},
“CoreInstanceType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for core instances.”,
“Type”: “String”
},
“EMRClusterName”: {
“Default”: “myEmrCluster”,
“Description”: “Name of the cluster”,
“Type”: “String”
},
“KeyName”: {
“Default”: “aws_big_data_demystified”,
“Description”: “Must be an existing Keyname”,
“Type”: “String”
},
“LogUri”: {
“Default”: “s3://aws-logs-123-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 URL”,
“Type”: “String”
},
“MasterInstacneType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for the master instance.”,
“Type”: “String”
},
“NumberOfCoreInstances”: {
“Default”: 1,
“Description”: “Must be a valid number”,
“Type”: “Number”
},
“ReleaseLabel”: {
“Default”: “emr-5.13.0”,
“Description”: “Must be a valid EMR release version”,
“Type”: “String”
},
“S3DataUri”: {
“Default”: “s3://aws-logs-1234-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 bucket URL “,
“Type”: “String”
},
“SubnetID”: {
“Default”: “subnet-1234e”,
“Description”: “Must be Valid public subnet ID”,
“Type”: “String”
}
},
“Resources”: {
“EMRCluster”: {
“DependsOn”: [
“EMRClusterServiceRole”,
“EMRClusterinstanceProfileRole”,
“EMRClusterinstanceProfile”
],
“Properties”: {
“Applications”: [
{
“Name”: “Ganglia”
},
{
“Name”: “Spark”
},
{
“Name”: “Hive”
},
{
“Name”: “Tez”
},
{
“Name”: “Zeppelin”
},
{
“Name”: “Oozie”
},
{
“Name”: “Hue”
},
{
“Name”: “Presto”
},
{
“Name”: “Livy”
}
],
“AutoScalingRole”: “EMR_AutoScaling_DefaultRole”,
“Configurations”: [
{
“Classification”: “hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
},
{
“Classification”: “spark”,
“ConfigurationProperties”: {
“maximizeResourceAllocation”: “true”
}
},
{
“Classification”: “presto-connector-hive”,
“ConfigurationProperties”: {
“hive.metastore.glue.datacatalog.enabled”: “true”
}
},
{
“Classification”: “spark-hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
}
],
“Instances”: {
“AdditionalMasterSecurityGroups”: [
“sg-1234”
],
“AdditionalSlaveSecurityGroups”: [
“sg-1234”
],
“CoreInstanceGroup”: {
“InstanceCount”: {
“Ref”: “NumberOfCoreInstances”
},
“InstanceType”: {
“Ref”: “CoreInstanceType”
},
“Market”: “ON_DEMAND”,
“Name”: “Core”
},
“Ec2KeyName”: {
“Ref”: “KeyName”
},
“Ec2SubnetId”: {
“Ref”: “SubnetID”
},
“MasterInstanceGroup”: {
“InstanceCount”: 1,
“InstanceType”: {
“Ref”: “MasterInstacneType”
},
“Market”: “ON_DEMAND”,
“Name”: “Master”
},
“TerminationProtected”: false
},
“JobFlowRole”: {
“Ref”: “EMRClusterinstanceProfile”
},
“LogUri”: {
“Ref”: “LogUri”
},
“Name”: {
“Ref”: “EMRClusterName”
},
“ReleaseLabel”: {
“Ref”: “ReleaseLabel”
},
“ServiceRole”: {
“Ref”: “EMRClusterServiceRole”
},
“VisibleToAllUsers”: true
},
“Type”: “AWS::EMR::Cluster”
},
“EMRClusterInstanceGroupConfig”: {
DependsOn“: “EMRCluster”,
“Properties”: {
“Market”: “SPOT”,
“AutoScalingPolicy”: {
“Constraints”: {
“MaxCapacity”: 40,
“MinCapacity”: 0
},
“Rules”: [
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 4
}
},
“Description”: “yarn-scale-out2”,
“Name”: “yarn-scale-out2”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “LESS_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 20,
“Unit”: “PERCENT”
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: -1
}
},
“Description”: “yarn-scale-in1”,
“Name”: “yarn-scale-in1”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 80,
“Unit”: “PERCENT”
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 4
}
},
“Description”: “con-scale-out”,
“Name”: “con-scale-out”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “ContainerPendingRatio”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 0.75,
“Unit”: “COUNT”
}
}
}
]
},
“BidPrice”: “15”,
“EbsConfiguration”: {
“EbsBlockDeviceConfigs”: [
{
“VolumeSpecification”: {
“SizeInGB”: “50”,
“VolumeType”: “gp2”
},
“VolumesPerInstance”: “1”
}
],
“EbsOptimized”: “true”
},
“InstanceCount”: 1,
“InstanceRole”: “TASK”,
“InstanceType”: “r4.xlarge”,
“JobFlowId”: {
“Ref”: “EMRCluster”
},
“Name”: “TaskSpotsNinja”
},
“Type”: “AWS::EMR::InstanceGroupConfig”
},
“EMRClusterServiceRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“elasticmapreduce.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole”
],
“Path”: “/”,
“Policies”: [
{
“PolicyName”: “s3fullaccess”,
“PolicyDocument”: {
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: “s3:*”,
“Resource”: “*”
}
]
}
}
]
},
“Type”: “AWS::IAM::Role”
},
“EMRClusterinstanceProfile”: {
“Properties”: {
“Path”: “/”,
“Roles”: [
{
“Ref”: “EMRClusterinstanceProfileRole”
}
]
},
“Type”: “AWS::IAM::InstanceProfile”
},
“EMRClusterinstanceProfileRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“ec2.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
},
“TestStep”: {
“Type”: “AWS::EMR::Step”,
“DependsOn”: “EMRClusterInstanceGroupConfig”,
“Properties”: {
“ActionOnFailure”: “CONTINUE”,
“HadoopJarStep”: {
“Args”: [
“s3://byoo-emr-bootstrap/bootstrap-emr.sh”
],
“Jar”: “s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar”
},
“Name”: “CustomBootstrap”,
“JobFlowId”: {
“Ref”: “EMRCluster”
}
}
},
“myDNSRecord”: {
“Type”: “AWS::Route53::RecordSet”,
“DependsOn”: [“EMRCluster”],
“Properties”: {
“HostedZoneName”: “b-yoo.net.”,
“Comment”: “DNS name for my instance. for emr. cloud formation”,
“Name”: “xxx.b-yoo.net”,
“Type”: “CNAME”,
“TTL”: “600”,
“ResourceRecords”: [
{
“Fn::GetAtt”: [
“EMRCluster”,
“MasterPublicDNS”
]
}
]
}
}
}
}

Back to the scheduling… via Lambda

Once you settled on your cloud formation stack, you may want to trigger it at 0900 and kill it at 17:00

here is a lambda boto3 code snippet for the lunch. notice this is a more advanced example the example in the beginning of the blog, as I added the options to select an application in the JSON. and I added the “CAPABILITY_IAM” explicit acknowledgement required by CF:

import boto3

def lambda_handler(event, context):
client = boto3.client(‘cloudformation’)
response = client.create_stack(
StackName=’StgEMR’,
Parameters=[
{
‘ParameterKey’: ‘Applications’,
‘ParameterValue’: ‘Spark’,
‘UsePreviousValue’: True,
‘ResolvedValue’: ‘string’
},
],
Capabilities=[
‘CAPABILITY_IAM’,
],
TemplateURL=’https://myBucket/emrClusterCloudFormation.json’)
return(response)

with a cloud formation trigger  with a schedule expression:

Schedule expression: cron(0 9 ? * SUN-THU *)

and another lambda code snippet would be for the destroy

import boto3

def lambda_handler(event, context):
client = boto3.client(‘cloudformation’)
response = client.delete_stack(StackName=’StgEMR’)

with a cloud formation trigger  with a schedule expression:

Schedule expression: cron(0 17 ? * SUN-THU *)

 

Important note about this lambda:

you will need to create a role for lambda that include permissions for:

  1. CloudFormation policy that creates and delete stack
  2. Route53 for managing the DNS
  3. IAM policies to add those EMR roles
  4. EMR polices to lunch clusters
  5. s3 read only to read the JSON file 🙂

I highly recommend to use least privilege practice to minimise the permissions given to the lambdas and using VPC to run the Lambda.

 

Option 2 Summery

  • we learned several ways to lunch an EMR cluster from 0900 to 1700
  • Just for perspective, it took my 5 full working days to create the EMR json, and and 1 hour to work with lambda. 🙂
  • CloudFormation can you achiever many things with EMR, though the JSON creation process was not trivial, and the documentation was a bit lacking in terms of EMR and cloud formation, i was able to provide a comprehensive working cloud formation example for your to play with and customize to your needs
  • Once the JSON of CF is ready, triggering within lambda will take you about an hour including the learning curve.
  • The beauty of using CF, is that stack takes care of the resources, you do not need to handle instance id, or public dns, you simple working with dynamic parameters.
  • The downside of working with EMR and CF, that as of  today, there is no easy and trivial way to add load balancers to the json, as the EMR wont return the instance ID of the master node, only its public DNS.

 

Thanks, and have fun!

—————————————————————————————————–

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS EMR

EMR and watch dog: service-nanny?

Want to have a watchdog to start the service if it crashing for any reason?

There are many ways to solve this. Some of them are these. just for the record, the reason i needed it, is b/c i need to start the Spark Thrift service fore JDBC which crashes every-time there is an out of memory.

Solution 1: Linux CRON

Using the good old cron entry which executes a small script in every 5 min(easily customizable).  This checks if this process is there. If not start this process. If you need this Thrift server to be started only in Master node then you can step to do that.You can use Script runner for running a custom script (which is stored in s3) [1]
Advantage of this is simple to code and maintain. Additionally is not dependent on a particular EMR version/service.

Example to create an EMR cluster with script runner step.

aws emr create-cluster –name “Test cluster” –-release-label emr-5.16.0 –applications Name=Hive Name=Pig –use-default-roles –ec2-attributes KeyName=myKey –instance-type m4.large –instance-count 3 –steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[“s3://mybucket/script-path/my_script.sh”]

Solution 2 service-nanny (Note: This is untested)

This solution utilizes the service-nanny which is a service watchdog in all EMR cluster.
Create a service-nanny configuration (/etc/service-nanny/yourservice.conf) This conf file will have some basic info regarding the process. So you can create a conf file. Put this in s3. Download it via step. (If you only want to execute in Master node). Once the files are in place, then restart the service-nanny. You can start and stop service nanny using the command below :

sudo /etc/init.d/service-nanny stop
sudo /etc/init.d/service-nanny start

You can see some sample about service-nanny in this path /usr/lib/service-nanny/example. The possible disadvantage for this would be if EMR decided to remove service-nanny in some future release you may need to fall back to Solution 1.

Note: Solution 2 is untested. So please test it thoroughly before using this in production.

resources:
[1] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html

 

—————————————————————————————————–

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS EMR

How to increase disk space on master node root partition in EMR

I have tested the following solution on my side using the r4.4xlarge instance type

Please find the steps-

Step-1) Increase the root volume of the EMR master node

To navigate to the EMR cluster’s master node’s root EBS volume the following steps can be taken:

– Open the EMR cluster in the EMR console
– Expand the hardware dropdown
– Click on the Instance Group ID that is labelled as the MASTER
– Click on the EC2 instance ID shown in this table, which will open the master node in the EC2 console
– In the “Description” tab in the information panel at the bottom of the console, scroll down and click on the linked device for the “Root device” entry
– The EBS volume will now open in the EC2 console, this is the root EBS Volume for the master node, and should look like the screenshot below
– Now you should be able to choose the “Modify Volume” action from the “Actions” dropdown, and change the volume size!

In this case, I adjusted the size of the EBS volume from 10GB to 50GB, simple as that! Alternatively, you could give the customer the exact CLI command to do this rather than trying to guide them through this process on the console:

aws ec2 modify-volume –region us-east-1 –volume-id vol-xxxxxxxxxxxxxxxxxxxx –size 50 –volume-type gp2

Step-2) Login to the Master node with SSH and check run the following command to check the newly attached size information under /dev/xvda.

lsblk
df -h

Step-3) However, it is important to note that at this point you will still not see the additional space on the file system (root volume /dev/xvda1). Run the following command to add the following space to root volume.

sudo /usr/bin/cloud-init -d single -n growpart
sudo /usr/bin/cloud-init -d single -n resizefs

Step-4) Now run below command to see “/” volume is increased to 50GB.

df -h
lsblk

Step-5) After the volume is increased, you can do a test create a sample test file and see the size of “/” volume increased its usage.

sudo fallocate -l 10G /test

(This will create a test file inside “/” with 10GB in size)

df -h

(Verify root volume mount point increased its usage)

Step-6) After verifying, just delete sample /test file.

sudo rm -rf /test

Note: Please back up any important file or configuration files before performing the operation.


Want to get more content about big data?



——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/



architecture, AWS EMR, Data Engineering

How to restart AWS EMR Hive Metastore

You can check the status, start and stop the Hive metastore using the following commands in EMR

sudo initctl status hive-hcatalog-server

sudo initctl stop hive-hcatalog-server

sudo initctl start hive-hcatalog-server

The logs for the hive metastore will be available  in the master node  at the path:

/var/log/hive-hcatalog/

 

Want to get more content about big data?



——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/