AWS Big Data Demystified - Part 2 | Big Data Demystified

The video ( to forget to subscribe to our youtube channel to help our community)

Lectures slides:

Some follow up questions from the meetup attendees via email (answer may be inlined):

1. I did all the steps as you guided in the EMR lecture (#2) including installing Presto and setting on the Glue check mark (for all : Presto,spark and hive)

Then I created schema , table , parquet table (in Athena) and run in hive “insert overwrite” – exactly the same way as you guided in the lecture

Now when I open Hue – in Hive I see the new DB (tpch_data) but in Presto I don’t see this db

see the attachment

Moreover Presto SQLs are not running (not finish) when I run Presto from Hue

Might be the problem is that I created db from Athena?

if you create the table in hive, you have to add the database before table name

otherwise it will be created on local database of hive.

I have not used presto except for testing (when i created the architecture), i believe it is a common mistake we do here in my team as well.

if you have schema created in athena called sampleDB

in hive (or where ever ever u create a table )

you must use

create table sampleDB.tableName ()

this should do the trick. notice the list of databases on the left in hue (default, tpchdata), open then to see where the table was created.

if not, let me know, we can resolve this via screen sharing.

2. Zeppelin is not loading. When I click on Zeppelin link in AWS EMR page the page is not loading…

All others apps load good – only Zeppling is stuck

did you open the SG for the port 8890?

are using tunnel?

did u use foxy proxy to enable access on your browser

did you configure something special to zeppelin?

FW issue on your office side?

try restarting the zeppelin sudo stop zeppelin, sudo start zeppelin

if all the above fails, screen share will help 🙂

3. ARCHITECTURE

1. Can you share your retention policies for the data?

I mean according to your architecture you do:

1. gzip texts files and load it to S3 dedicated raw data buckets (gzipped)

2. clean, transfrom to parquet (via hive) and save the parquet files in other S3 buckets hierarchy (

3. module, data enrichemnt, flattening etc.. and again I guess you have here other S3 buckets hierarchy

In bottom line you have 3 logical layers of data :

a. original text

b. original, cleaned parquet uncomporessed

c. moduled parquet gzip

[same same but different. I keep one bucket per data source, with 3 “folders” inside , each for folder per layer above u mentioned above. this is due to different restriction i have from business side on each data source for example different encryption \ retention \ access management]

The questions are :

3.1. Do you apply the same retention policy to all layers?

yes and no.

No: b/c i treat raw data differently. i delete it after i finished initial transformation to parquet and gzip. (no cleansing or changes)

different encryption policies for each data source.

yes: once the data is transformed and modelted, i keep it for 120 days. ( cookie live cycle is 90 days)

3.2.How long do you keep it?

120 days , per use case.

3.3. Do you use / plan use AWS Glacier for old data?

no need for COLD data in my use case. 🙂 check the costs …

3.4. How do you handle GDPR? If someone ask to delete data 6 months old . You can not delete from parquet so the only way to do this is : delete the entire file and regenerate it. For this you need some kind of index to find the target for deletion files (for all 3 layers)

Do you have something like that?

keep the data partitioned per day, have a s3 lifecycle policy on the bucket for 180 days (pert specific path) , use msck daily, and use hive external tables + dynamic partitioning for inserts. this way , when u delete data on bucket level, and use msck, the table is “updated” with removed partitions, and the jobs are not failing.

Note – there is not insert overwrite and dynamic partitioning in presto…

as for GDPR…

we started working on obfuscation on the data and delete the raw data, and keeping the obfuscated data anonymized and adding opt in and out procedures, and checking if the citizen is an EU citizen. we have much work to do. [ see lecture 4….]

3. Presto+Zeppelin vs. Athena

I understand all advantages of managed service but in this specific case I don’t see many

Install dediacted EMR cluster with Presto , Zeppelin and Ganglia, working with extrenal tables , data in s3 . NO Hadoop . + Autoscale

Why do I need Athena?? What do I miss ?

this is a good question. yes, both are applicable. it is a matter of cost/performance ratio and more.if you have many queries per day, the cost of athena will be high. as the cost is mainly on TB read per day. however if you have same query running over and over in 24 hours, the results are cached. so the cost will be lower.

so in the end of the day it is a matter of use case, expected concurrency, budget, and performance targets, simplicity VS flexibility (managed notebooks is not a feature of athena, but for ad hoq query/ minimal BI backend it should do the trick).

what will be cheaper? need to know how many TB’s are read to answer this best way. as atehna uses on demand resources, and EMR may use spot instance and auto scaling.

what will be faster? could go either way depending on the cluster , as EMR are dedicated resources

Here is the link (you probably familiar with) how to integrate presto with zeppelin

https://medium.com/walmartlabs/exploring-presto-and-zeppelin-for-fast-data-analytics-and-visualization-9cb4dca91c3d

The crazy stuff is that with Presto you can query Kafka and Elastic . So theoretically you can build more hollistic solutions without Athena. Also less cloud lock

sounds great! assuming u need to query kafka and elastic.

it will take a bit longer to implement, as there will probably some learning curve, but other than that, it sounds great. i won’t not worry about vendor lock 🙂

Technical comments from the audience sent via email:

Hi Omid, In your #2 presentation in the slides about Hive you use ‘org.openx.data.jsonserde.JsonSerDe’

This is very old serde and indeed you mentioned it in the slide But the problem with it if you create with this serde table in Hive – then you can NOT work with this table in Presto. When you run any query in Presto on this table – you get the exception :

deserializer does not exist: org.openx.data.jsonserde.JsonSerDe at com.facebook.presto.jdbc.PrestoResultSet.resultsException(PrestoResultSet.java

The correct serde to use is ‘org.apache.hive.hcatalog.data.JsonSerDe’ It’s in Presto classpath And so Presto works properly when you switch to it

————————————————-

Special thanks to vlady for the feedback 🙂

Want to get more content about big data?
- Contact me via linked in Omid Vahdaty
- website: https://amazon-aws-big-data-demystified.ninja/
- subscribe to our AWS Big Data Demystified youtube channel
- subscribe to our Big Data Demystified youtube channel

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS Big Data Demystified – Part 2

Some follow up questions from the meetup attendees via email (answer may be inlined):

Technical comments from the audience sent via email:

Want to get more content about big data?

Leave a ReplyCancel reply

Some follow up questions from the meetup attendees via email (answer may be inlined):

Technical comments from the audience sent via email:

Want to get more content about big data?

Leave a ReplyCancel reply

Discover more from Big Data Demystified