Use Redshift when
- Traditional data warehouse
- When you need the data relatively hot for analytics such as BI
- When there is no data engineering team
- When your queries require joins
- When you need a cluster 24X7
- When you data type are simple, i.e not Arrays, or Structs
- When data has no nested jsons
- When you have petabyte scale database
- When you want analize massive amount of data (spectrum)
- When you need update/delete
- When you require and ACID DBMS
Use EMR (SparkSQL, Presto, hive) when
- When you need a transient cluster, for night or hourly automation
- When compute elasticity is important (auto scaling on tasks)
- When cost is important: spot instances.
- When you data scales until a few hundred TB’s
- When you want to decouple compute and storage (external table + task node + auto scaling). this is cloud architecture best practice.
- When you require more flexibility
- Complex partitions + dynamic partitioning + insert overwrite. click on the link for an example.
- Complex data type
- Structs
- Arrays <–> nested json
- Orchestration built in such as Oozie, although Airflow is more common.
- Notebook built in – mix your code with SQL via Zeppelin
Watch this meetup video to understand in depth Big Data Architecture conciderations in AWS.
Please check below Redshift specific faq:
Q: When would I use Amazon Redshift vs. Amazon EMR?
Q: Can Redshift Spectrum replace Amazon EMR?
Q: Can I use Redshift Spectrum to query data that I process using Amazon EMR?
— Reference : Redshift faq
https://aws.amazon.com/redshift/faqs/
Please check below EMR specific faq:
Q: What can I do with Amazon EMR?
Q: Who can use Amazon EMR?
Q: What can I do with Amazon EMR that I could not do before?
Q: What is the data processing engine behind Amazon EMR?
Q: What is Apache Spark?
Q: What is Presto?
— Reference : EMR faq
https://aws.amazon.com/emr/faqs/
** Point 2. I am listing other resources which can help to understand RDS and EMR use cases better.
— Reference :
AWS redshift related case studies > Look for case study section :
https://aws.amazon.com/redshift/getting-started/
https://pages.awscloud.com/redshift-proof-of-concept-request.html
— Reference :
AWS EMR related case studies > Look for case study section :
https://aws.amazon.com/emr/
https://pages.awscloud.com/GLOBAL_OT_emr-poc_20170530.html
** Point 3. I have tried to check some of AWS blogs which shows how EMR and RDS can be used together in specific use cases.
— How I built a data warehouse using Amazon Redshift and AWS services in record time
https://aws.amazon.com/blogs/big-data/how-i-built-a-data-warehouse-using-amazon-redshift-and-aws-services-in-record-time/
— Build a Healthcare Data Warehouse Using Amazon EMR, Amazon Redshift, AWS Lambda, and OMOP
https://aws.amazon.com/blogs/big-data/build-a-healthcare-data-warehouse-using-amazon-emr-amazon-redshift-aws-lambda-and-omop/
— Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning
https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/
Hope this information helps in understanding EMR and Redshift use cases better.
Need to learn more about aws big data?
- Contact me via linked in Omid Vahdaty
- website: https://amazon-aws-big-data-demystified.ninja/
- Join our meetup, FB group and youtube channel
- Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
- Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
- subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:
https://www.linkedin.com/in/omid-vahdaty/