The video ( to forget to subscribe to our youtube channel to help our community)
Some follow up questions from the meetup attendees via email (answer may be inlined):
1. I did all the steps as you guided in the EMR lecture (#2) including installing Presto and setting on the Glue check mark (for all : Presto,spark and hive)Then I created schema , table , parquet table (in Athena) and run in hive “insert overwrite” – exactly the same way as you guided in the lectureNow when I open Hue – in Hive I see the new DB (tpch_data) but in Presto I don’t see this dbsee the attachmentMoreover Presto SQLs are not running (not finish) when I run Presto from HueMight be the problem is that I created db from Athena?
2. Zeppelin is not loading. When I click on Zeppelin link in AWS EMR page the page is not loading…All others apps load good – only Zeppling is stuck
3. ARCHITECTURE1. Can you share your retention policies for the data?I mean according to your architecture you do:1. gzip texts files and load it to S3 dedicated raw data buckets (gzipped)2. clean, transfrom to parquet (via hive) and save the parquet files in other S3 buckets hierarchy (3. module, data enrichemnt, flattening etc.. and again I guess you have here other S3 buckets hierarchyIn bottom line you have 3 logical layers of data :a. original textb. original, cleaned parquet uncomporessedc. moduled parquet gzip[same same but different. I keep one bucket per data source, with 3 “folders” inside , each for folder per layer above u mentioned above. this is due to different restriction i have from business side on each data source for example different encryption \ retention \ access management]
The questions are :3.1. Do you apply the same retention policy to all layers?
3.2.How long do you keep it?
3.3. Do you use / plan use AWS Glacier for old data?
3.4. How do you handle GDPR? If someone ask to delete data 6 months old . You can not delete from parquet so the only way to do this is : delete the entire file and regenerate it. For this you need some kind of index to find the target for deletion files (for all 3 layers)Do you have something like that?
3. Presto+Zeppelin vs. AthenaI understand all advantages of managed service but in this specific case I don’t see manyInstall dediacted EMR cluster with Presto , Zeppelin and Ganglia, working with extrenal tables , data in s3 . NO Hadoop . + AutoscaleWhy do I need Athena?? What do I miss ?
Here is the link (you probably familiar with) how to integrate presto with zeppelinhttps://medium.com/walmartlabs/exploring-presto-and-zeppelin-for-fast-data-analytics-and-visualization-9cb4dca91c3dThe crazy stuff is that with Presto you can query Kafka and Elastic . So theoretically you can build more hollistic solutions without Athena. Also less cloud lock
Technical comments from the audience sent via email:
deserializer does not exist: org.openx.data.jsonserde.JsonSerDe at com.facebook.presto.jdbc.PrestoResultSet.resultsException(PrestoResultSet.java
Special thanks to vlady for the feedback 🙂
Want to get more content about big data?
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me: