When should we use EMR and When should we use Redshift? EMR VS Redshift

Use Redshift when

Traditional data warehouse
When you need the data relatively hot for analytics such as BI
When there is no data engineering team
When your queries require joins
When you need a cluster 24X7
When you data type are simple, i.e not Arrays, or Structs
When data has no nested jsons
When you have petabyte scale database
When you want analize massive amount of data (spectrum)
When you need update/delete
When you require and ACID DBMS

Use EMR (SparkSQL, Presto, hive) when

When you need a transient cluster, for night or hourly automation
When compute elasticity is important (auto scaling on tasks)
When cost is important: spot instances.
When you data scales until a few hundred TB’s
When you want to decouple compute and storage (external table + task node + auto scaling). this is cloud architecture best practice.
When you require more flexibility
1. Complex partitions + dynamic partitioning + insert overwrite. click on the link for an example.
2. Complex data type
  1. Structs
  2. Arrays <–> nested json
3. Orchestration built in such as Oozie, although Airflow is more common.
4. Notebook built in – mix your code with SQL via Zeppelin

Watch this meetup video to understand in depth Big Data Architecture conciderations in AWS.

Please check below Redshift specific faq:

Q: When would I use Amazon Redshift vs. Amazon EMR?
Q: Can Redshift Spectrum replace Amazon EMR?
Q: Can I use Redshift Spectrum to query data that I process using Amazon EMR?

— Reference : Redshift faq
https://aws.amazon.com/redshift/faqs/

Please check below EMR specific faq:

Q: What can I do with Amazon EMR?
Q: Who can use Amazon EMR?
Q: What can I do with Amazon EMR that I could not do before?
Q: What is the data processing engine behind Amazon EMR?
Q: What is Apache Spark?
Q: What is Presto?

— Reference : EMR faq
https://aws.amazon.com/emr/faqs/

** Point 2. I am listing other resources which can help to understand RDS and EMR use cases better.

— Reference :
AWS redshift related case studies > Look for case study section :
https://aws.amazon.com/redshift/getting-started/
https://pages.awscloud.com/redshift-proof-of-concept-request.html

— Reference :
AWS EMR related case studies > Look for case study section :
https://aws.amazon.com/emr/
https://pages.awscloud.com/GLOBAL_OT_emr-poc_20170530.html

** Point 3. I have tried to check some of AWS blogs which shows how EMR and RDS can be used together in specific use cases.

— How I built a data warehouse using Amazon Redshift and AWS services in record time
https://aws.amazon.com/blogs/big-data/how-i-built-a-data-warehouse-using-amazon-redshift-and-aws-services-in-record-time/

— Build a Healthcare Data Warehouse Using Amazon EMR, Amazon Redshift, AWS Lambda, and OMOP
https://aws.amazon.com/blogs/big-data/build-a-healthcare-data-warehouse-using-amazon-emr-amazon-redshift-aws-lambda-and-omop/

— Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning
https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/

Hope this information helps in understanding EMR and Redshift use cases better.

Need to learn more about aws big data?

Contact me via linked in Omid Vahdaty
website: https://amazon-aws-big-data-demystified.ninja/
Join our meetup, FB group and youtube channel
Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

Need to learn more about aws big data?

Leave a ReplyCancel reply

Discover more from Big Data Demystified