architecture, AWS athena, AWS EMR, cost reduction

When should we use EMR and When should we use Redshift? EMR VS Redshift

Use Redshift when

  1. Traditional data warehouse
  2. When you need the data relatively hot for analytics such as BI
  3. When there is no data engineering team
  4. When your queries require joins
  5. When you need a cluster 24X7
  6. When you data type are simple, i.e not Arrays, or Structs
  7. When data has no nested jsons
  8. When you have petabyte scale database
  9. When you want analize massive amount of data (spectrum)
  10. When you need update/delete
  11. When you require and ACID DBMS

Use EMR (SparkSQL, Presto, hive) when

  1. When you need a transient cluster, for night or hourly automation 
  2. When compute elasticity is important (auto scaling on tasks)
  3. When cost is important: spot instances. 
  4. When you data scales until a few hundred TB’s
  5. When you want to decouple compute and storage (external table + task node + auto scaling). this is cloud architecture best practice.
  6. When you require more flexibility
    1. Complex partitions + dynamic partitioning + insert overwrite. click on the link for an example.
    2. Complex data type
      1. Structs
      2. Arrays <–> nested json
    3. Orchestration built in such as Oozie, although Airflow is more common.
    4. Notebook built in – mix your code with SQL via   Zeppelin

Watch this meetup video to understand in depth Big Data Architecture conciderations in AWS.

Please check below Redshift specific faq: 

Q: When would I use Amazon Redshift vs. Amazon EMR?
Q: Can Redshift Spectrum replace Amazon EMR?
Q: Can I use Redshift Spectrum to query data that I process using Amazon EMR?

— Reference : Redshift faq
https://aws.amazon.com/redshift/faqs/

Please check below EMR specific faq:

Q: What can I do with Amazon EMR?
Q: Who can use Amazon EMR?
Q: What can I do with Amazon EMR that I could not do before?
Q: What is the data processing engine behind Amazon EMR?
Q: What is Apache Spark?
Q: What is Presto?

— Reference : EMR faq
https://aws.amazon.com/emr/faqs/

** Point 2. I am listing other resources which can help to understand RDS and EMR use cases better.

— Reference :
AWS redshift related case studies > Look for case study section :
https://aws.amazon.com/redshift/getting-started/
https://pages.awscloud.com/redshift-proof-of-concept-request.html

— Reference :
AWS EMR related case studies > Look for case study section :
https://aws.amazon.com/emr/
https://pages.awscloud.com/GLOBAL_OT_emr-poc_20170530.html

** Point 3. I have tried to check some of AWS blogs which shows how EMR and RDS can be used together in specific use cases. 

— How I built a data warehouse using Amazon Redshift and AWS services in record time
https://aws.amazon.com/blogs/big-data/how-i-built-a-data-warehouse-using-amazon-redshift-and-aws-services-in-record-time/

— Build a Healthcare Data Warehouse Using Amazon EMR, Amazon Redshift, AWS Lambda, and OMOP
https://aws.amazon.com/blogs/big-data/build-a-healthcare-data-warehouse-using-amazon-emr-amazon-redshift-aws-lambda-and-omop/

— Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning
https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/

Hope this information helps in understanding EMR and Redshift use cases better.

Need to learn more about aws big data (demystified)?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s