architecture, AWS, AWS Big Data Demystified, Data Engineering, meetup

AWS Big Data Demystified – Part 1 2018

Lecture Video:

spoken Language: English.

Aws big-data-demystified #1.1 | Big Data Architecture Lessons Learned | lecture Keywords:

1. AWS Big Data Demystified #1.1 Big Data Architecture Lessons Learned Omid Vahdaty, Big Data Ninja

2. Disclaimer

  • I am not trying to sell anything.
  • This is purely knowledge transfer session.
  • You are more than welcome to challenge each slide, during the lecture and afterwards πŸ™‚
  • This lecture was released at 2018, take into account things change over time.

3. When the data outgrows your ability to process

  • Volume (100TB processing per day)
  • Velocity (4GB/s)
  • Variety (JSON, CSV, Veracity (how much of the data is accurate?)

4. In the past (web,api, ops db, data warehouse) API DW




8. Challenges creating big data architecture?

  • What is the business use case ?
  • How fast do u need the insights?
  • 15 min – 24 hours delay and above ?
  • Less than 15 min?
  • Streaming?  Sub seconds delay?   Sub minute delay? Streaming with in flight analytics ?  How complex is the compute jobs? Aggregations? joins?

9. Challenges creating big data architecture?

  • What is the velocity?
  • Under 100K events per second? Not a problem
  • Over 1M events per second? Costly. But doable.
  • Over 1B events per seconds? Not trivial at all.
  • Volume ?  Well…. It depends.
  • Veracity (how are you going to handle different data sources?)
  • Structured (CSV)
  • Semi structured (JSON,XML)
  • Unstructured (pictures, movies etc)

10. Challenges creating big data architecture?

  • Performance targets?
  • Costs targets?
  • Security restrictions?
  • Regulation restriction?
  • privacy?
  • Which technology to choose?
  • Datacenter or cloud?
  • Latency?
  • Throughput?
  • Concurrency?
  • Security Access patterns?
  • Pass? Max 7 technologies
  • Iaas? Max 4 technologies

11. Cloud Architecture rules of thumb…

  • Decouple : β—‹ Store β—‹ Process β—‹ Store β—‹ Process β—‹ insight…
  • Rule of thumb: max 3 technologies in dc, 7 tech max in cloud
  • Don’t use more b/c: maintenance β—‹ Training time β—‹ complexity/simplicity

12. Use Case 1: Analyzing browsing history

  • Data Collection: browsing history from an ISP
  • Product – derives user intent and interest for marketing purposes.
  • Challenges β—‹ Velocity: 1 TB per day β—‹ History of: 3M β—‹ Remote DC β—‹ Enterprise grade security β—‹ Privacy

13. Use Case 2: Insights from location based data

  • Data collection: from a Mobile operator
  • Products: β—‹ derives user intent and interest for marketing purposes. β—‹ derive location based intent for marketing purposes.
  • Challenges β—‹ Velocity: 4GB/s … β—‹ Scalability: Rate was expected double every year… β—‹ Remote DC β—‹ Enterprise grade security β—‹ Privacy β—‹ Edge analytics

14. Use Case 3: Analyzing location based events.

  • Data collection: streaming
  • Product: building location based audiences
  • Challenges: minimizing DevOps work on maintenance of a DIY streaming system

15. So what is the product?

Big data platform that β—‹ collects data from multiple sources β—‹ Analyzes the data β—‹ Generates insights : β–  Smart Segments (online marketing) β–  Smart reports (for marketer) β–  Audience analysis (for agencies) ● Customers? β—‹ Marketers β—‹ Publishers β—‹ Agencies

16. My Big Data platform is about:

  • Data Collection β—‹ Online β–  messaging β–  Streaming β—‹ Offline β–  Batch β–  Performance aspects
  • Data Transformation (Hive) β—‹ JSON, CSV, TXT, PARQUET, Binary
  • Data Modeling – (R, ML, AI, DEEP, SPARK)
  • Data Visualization (choose your poison) ● PII regulation + GPDR regulation
  • And: Performance… Cost… Security… Simple… Cloud best practices…

17. Big Data Generic Architecture Data Collection (file based ETL from remote DC) Data Transformation ( row to colunar + cleansing) Data Modeling ( joins/agg/ML/R) Data Visualization Text, RAW

18. Big Data Generic Architecture | Data Collection Data Collection Data Transformation Data Modeling Data Visualization

19. Batch Data collection considerations

  • Every hour , about 30GB compressed CSV file
  • Why s3 β—‹ Multi part upload β—‹ S3 CLI β—‹ S3 SDK β—‹ (tip : gzip! )
  • Why Client – needs to run at remote DC
  • Why NOT your own client β—‹ Involves code β†’ β–  Bugs? β–  maintenance β—‹ Don’t analyze data at Edge , since you cant go back in time.
  • Why Not Streaming? β—‹ less accurate β—‹ Expensive

20. S3 Considerations

  • Security β—‹ at rest: server side S3-Managed Keys (SSE-S3) β—‹ at transit: SSL / VPN β—‹ Hardening: user, IP ACL, write permission only.
  • Upload β—‹ AWS s3 cli β—‹ Multi part upload β—‹ Aborting Incomplete Multipart Uploads Using a Bucket Lifecycle Policy β—‹ Consider S3 CLI Sync command instead of CP

21. Sqoop – ETL

  • Open source , part of EMR
  • HDFS to RDMS and back. Via JDBC.
  • E.g BiDirectional ETL from RDS to HDFS
  • Unlikely use case: ETL from customer source DB.

22. Flume & Kafka

  • Opens source project for streaming & messaging
  • Popular ● Generic ● Good practice for many use cases. (a meetup by it self) ● Highly durable, scalable, extension etc.
  • Downside : DIY, Non trivial to get started

23. Data Transfer Options

  • VPN
  • Direct Connect (4GB/s?)
  • For all other use case β—‹ S3 multipart upload β—‹ Compression β—‹ Security
  • Data at motion
  • Data at rest
  • bandwidth

24. Quick intro to Stream collection

  • Kinesis Client Library (code)
  • AWS lambda (code)
  • EMR (managed hadoop)
  • Third party (DIY) β—‹ Spark streaming (latency min =1 sec) , near real time, with lot of libraries. β—‹ Storm – Most real time (sub millisec), java code based. β—‹ Flink (similar to spark)

25. Kinesis

  • Stream – collect@source and near real time processing
  • Near real time β—‹ High throughput β—‹ Low cost β—‹ Easy administration – set desired level of capacity β—‹ Delivery to : s3,redshift, Dynamo, … β—‹ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second. β—‹ Not managed!
  • Analytics – in flight analytics.
  • Firehose – Park you data @ destination.

26. Firehose – for Data parking

  • Not for fast lane – no in flight analytics
  • Capture , transform and load. β—‹ Kinesis β—‹ S3 β—‹ Redshift β—‹ elastic search
  • Managed Service

27. Comparison of Kinesis product

● Streams β—‹ Sub 1 sec processing latency β—‹ Choice of stream processor (generic) β—‹ For smaller events

● Firehose β—‹ Zero admin β—‹ 4 targets built in (redshift, s3, search, etc) β—‹ Buffering 60 sec minimum. β—‹ For larger β€œevents”

28. Big Data Generic Architecture | Data Collection Data Collection S3 Data Transformation Data Modeling Data Visualization

29. Big Data Generic Architecture | Transformation Data Collection S3 Data Transformation Data Modeling Data Visualization

30. EMR ecosystem

● Hive ● Pig ● Hue ● Spark ● Oozie ● Presto ● Ganglia ● Zookeeper (hbase) ● zeppelin

31. EMR Architecture

● Master node

● Core nodes – like data nodes (with storage: HDFS)

● Task nodes – (extends compute)

● Does Not have Standby Master node

● Best for transient cluster (goes up and down every night)

32. EMR lesson learned…

● Bigger instance type is good architecture

● Use spot instances – for the tasks.

● Don’t always use TEZ (MR? Spark?)

● Make sure your choose instance with network optimized

● Resize cluster is not recommended

● Bootstrap to automate cluster upon provisioning

● Use Steps to automate steps on running cluster

● Use Glue to share Hive MetaStore

33. So use EMR for …

● Most dominant β—‹ Hive β—‹ Spark β—‹ Presto

● And many more….

● Good for: β—‹ Data transformation β—‹ Data modeling β—‹ Batch β—‹ Machine learning

34. Hive

● SQL over hadoop.

● Engine: spark, tez, MR


● Not good when need to shuffle.

● Not peta scale.

● SerDe json, parquet,regex,text etc.

● Dynamic partitions

● Insert overwrite

● Data Transformation

● Convert to Columnar

35. Presto

● SQL over hadoop

● Not good always for join on 2 large tables.

● Limited by memory

● Not fault tolerant like hive.

● Optimized for ad hoc queries

● No insert overwrite

● No dynamic partitions.

● Has some connectors : redshift and more

● https://amazon-aws-big-data- presto-demystified-everything-you- wanted-to-know-about-presto/

36. Pig

● Distributed Shell scripting

● Generating SQL like operations.

● Engine: MR, Tez

● S3, DynamoDB access

● Use Case: for data science who don’t know SQL, for system people, for those who want to avoid java/scala

● Fair fight compared to hive in term of performance only

● Good for unstructured files ETL : file to file , and use sqoop.

37. Hue

Hadoop user experience

● Logs in real time and failures.

● Multiple users

● Native access to S3.

● File browser to HDFS.

● Manipulate metascore

● Job Browser

● Query editor

● Hbase browser

● Sqoop editor, oozier editor, Pig Editor

38. Orchestration

● EMR Oozie β—‹ Opens source workflow

β–  Workflow: graph of action

β–  Coordinator: scheduler jobs β—‹ Support: hive, sqoop , spark etc. ● Other: AirFlow, Knime, Luigi, Azkaban,AWS Data Pipeline

39. Big Data Generic Architecture | Transformation Data Collection S3 Data Transformation Data Modeling Data Visualization

40. Big Data Generic Architecture | Modeling Data Collection S3 Data Transformation Data Modeling Data Visualization

41. Spark

● In memory

● X10 to X100 times faster

● Good optimizer for distribution

● Rich API

● Spark SQL

● Spark Streaming

● Spark ML (ML lib)

● Spark GraphX (DB graphs)

● SparkR

42. Spark Streaming

● Near real time (1 sec latency)

● like batch of 1sec windows

● Streaming jobs with API

● Not relevant to us…

43. Spark ML

● Classification

● Regression

● Collaborative filtering

● Clustering

● Decomposition

● Code: java, scala, python, sparkR

44. Spark flavours

● Standalone

● With yarn

● With mesos

45. Spark Downside

● Compute intensive

● Performance gain over mapreduce is not guaranteed.

● Streaming processing is actually batch with very small window.

● Different behaviour between hive and spark SQL

46. Spark SQL

● Same syntax as hive

● Optional JDBC via thrift

● Non trivial learning curve

● Upto X10 faster than hive.

● Works well with Zeppelin (out of the box)

● Does not replaces Hive

● Spark not always faster than hive ● insert overwrite –

47. Apache Zeppelin

● Notebook – visualizer

● Built in spark integration

● Interactive data analytics

● Easy collaboration.

● Uses SQL

● work s on top of Hive/ SparkSQL

● Inside EMR.

● Uses in the background: β—‹ Shiro β—‹ Livy

48. R + spark R

● Open source package for statistical computing.

● Works with EMR

● β€œMatlab” equivalent

● Works with spark

● Not for developer πŸ™‚ for statistician

● R is single threaded – use spark R to distribute.

● Not everything works perfect.

49. Redshift

● OLAP, not OLTPβ†’ analytics , not transaction

● Fully SQL

● Fully ACID

● No indexing

● Fully managed

● Petabyte Scale


● Can create slow queue for queries β—‹ which are long lasting. ● DO NOT USE FOR transformation.

● Good for : DW, Complex Joins.

50. Redshift spectrum

● Extension of Redshift, use external table on S3.

● Require redshift cluster.

● Not possible for CTAS to s3, complex data structure, joins.

● Good for β—‹ Read only Queries β—‹ Aggregations on Exabyte.

51. EMR vs Redshift

● How much data loaded and unloaded?

● Which operations need to performed?

● Recycling data? β†’ EMR

● History to be analyzed again and again ? β†’ emr

● What the data needs to end up? BI?

●Use spectrum in some use cases. (aggregations)?

● Raw data? s3.

52. Hive VS. Redshift

● Amount of concurrency ? low β†’ hive, high β†’ redshift

● Access to customers? Redshift?

● Transformation, Unstructured , batch, ETL β†’ hive.

● Peta scale ? redshift ● Complex joins β†’ Redshift

53. Big Data Generic Architecture | Modeling Data Collection S3 Data Transformation Data Modeling Data Visualization

54. Big Data Generic Architecture | Visualize Data Collection S3 Data Transformation Data Modeling Data Visualization

55. Athena

● Presto SQL

● In memory

● Hive metastore for DDL functionality

β—‹ Complex data types β—‹ Multiple formats β—‹ Partitions

● Good for: β—‹ Read only SQL, β—‹ Ad hoc query, β—‹ low cost, β—‹ managed

56. Visualize

● QuickSight

● Managed Visualizer, simple, cheap

57. Big Data Generic Architecture | Summary Data Collection S3 Data Transformation Data Modeling Data Visualization

58. Summary: Lesson learned

● Productivity of Data Science and Data engineering β—‹ Common language of both teams IS SQL! β—‹ Spark cluster has many bridges: SparkR, Spark ML, SparkSQL , Spark core.

● Minimize the amount DB’s used β—‹ Different syntax (presto/hive/redshift) β—‹ Different data types β—‹ Minimize ETLS via External Tables+Glue!

● Not always Streaming is justified (what is the business use case? PaaS?)

● Spark SQL β—‹ Sometimes faster than redshift β—‹ Sometimes slower than hive β—‹ Learning curve is non trivial

● Smart Big Data Architecture is all about: β—‹ Faster, Cheaper, Simpler, More Secured.

59. Stay in touch…

● Omid Vahdaty

● +972-54-2384178


● Join our meetup, FB group and youtube channel


I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

1 thought on “AWS Big Data Demystified – Part 1 2018”

Leave a Reply