There will be a time you will think on ways to reduce costs on your EMR cluster. Notice some of the tips will simply let your ran faster, and this usually imply less costs simply b/c of CPU utilization. If this is the case ,consider the below:
- Conceptually , switching to the Transient cluster instead 24X7 will be the best. It requires some preparation work of the automation, but it may be worth your time.
- Consider using R instances if your EMR jobs are memory intensive, remember the the over head of the Yarn, JVM is about 50% of what you have on the machine. You can confirm this via EMR ganglia, and check the amount of available RAM while the jobs are running.
- If you use a larger instance type, the NIC attached will have a faster quite of network bandwidth. the difference may be worth in in term of time vs cost while working with large external tables. the difference may be 300MB/s instead of 100MB/s on a smaller instance. If you witnessed something much faster – please let me know.
- Consider using the configuration “Maximise resource allocation “, this may shorten your running time.
- Naturally tuning of cluster specific / job specific configuration may be worth your time as well. but personally i am too lazy to try this, and the risk of these per query customs config propagating to there jobs and slowing them down is high due to human element. if u must tune, use the tuning per session – run it before your query, and set back to default values after you are done. you may want to consider cluster per jobs types. example: cluster for joins jobs, and cluster for aggregations. or cluster per DEV, QA, STG, PROD
- Consider using Task nodes. they provisioned quickly, can be killed and resized freely with zero impact on your cluster in terms of stability. peace of advice – make sure the task node is as same size of your data node, other wise your risk of underutilising your task nodes as the executors confirmations are as same of the data node.
- Consider using Spot instances…. especially on task nodes. save upto 80%!
- Consider using Reserved instances for your Data Nodes and master nodes. save upto 35% on costs.
- consider using several task groups, with different size/configs
- Auto scaling will help be more accurate in terms of costs, you can auto scale in a resolution of 5 mins (time it takes to provision an EMR node). Autoscaling is highly useful on 24X7 EMR cluster. using Auto scaling require some testing, as it does not behave exactly as you think all the time. In Cluster with Dynamic Resources allocation, the resources may be ready, but the boost in performance may take its time. auto scaling and task saved me about 50%. Naturally, when you save costs using tasks nodes and auto scaling, your get greey on a simple performance test, well until the auto scale in kicks in 🙂
- In good we trust, all the rest must bring data – use Ganglia to track the exact amount of resources your need (perhaps you are over provisioning).
- Minimal recommended cluster size is 3 machines, one 1 Master, 2 Data nodes. conceder the below suggestions
- EMR with only one machine (new feature), which is the master node, data node in the same nachine
- EMR with 1 master, 1 Data, and if you must scale, add Task nodes, with auto scaling. notice the minimal amount of machines in the task group can be zero. notice, this should not be used in production as the stability of your cluster is much lower, even if you are not using your data on local tables. if your data node dies, the entire cluster becomes unusable, and this is unrecoverable.
- Encryption at rest and Encryption in motion, may be good for security reason, but may have a massive impact on production in terms of resources, running time etc. confirm security is a must before you apply security on transient cluster. Consult your CISO for this. Notice the encryption on S3 is hardware based, but still i would perform a simple benchmark test to see the cost benefit ration.
- If you can afford it, and it is technically valid , please test your jobs on both Hive / Spark/ Presto. Further more, test different compression types and storage types.
- I know for a fact from benchmarks i performed there are some cases Hive will be faster than spark.
- I am less familiar the presto, but i am positive there may be useless it will be faster.
- from a few benchmarks I performed , your will be surprised to know that using different compression types, may have massive impact on Write time to S3 and Read time (if the data is compressed better). I personally work with Parquet of GZIP. but this only work perfectly with my useless.
- Notice compression has impact on CPU utilisation, so it is not a clear cut what will be cheaper (parquet/orc GZIP, BZIP) nor which will be faster (spark / hive/ presto) .
- did you switch to columnar? if not try the this link as reference: convert to columnar from raw based data.
- did u use partitioning? did you use the correct partitioning for your query?
- if using ORC consider using bucketing on top of partitioning
- was your data spliced into chunks? if so try to change chunk size. more complicated but doable, again, could go either way – need to test this will your data.
- apply hints on the table may help on time spend on data scan in some cases.
- if using multiple tables join, order of joins, may impact scanned data, shotterning running time.
- consider pre aggregating data if possible as part of your transformation/cleansing process. even if it is on each (using window table, each row will hold aggregation tables. )
- consider pre calculating table with heavy group by on raw-data. i.e have the data already calculated on s3, and have your production user/ end user query that table.
- have a data engineer review each query, to make sure data scan is minimised. for example
- Minimise the columns in the results set… a results set of longs strings maybe be very costly.
- where possible switch strings to ints, this will minimise footprint on storage greatly.
- if possible switch from bigint to tinyint. this will save some disk space as well. notice the list of supported data types: https://prestodb.io/docs/current/language/types.html
As you can see, There are many ways to save costs on AWS EMR. The easiest thing will be to use task groups and scale in/out based on your needs. The rest may take some time, but will be worth it.
Need to learn more about aws big data (demystified)?
- Contact me via linked in Omid Vahdaty
- website: https://amazon-aws-big-data-demystified.ninja/
- Join our meetup, FB group and youtube channel
- Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
- Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
- subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me: