architecture, AWS athena, AWS EMR, Cloud, Data Engineering, Spark

AWS Big Data in 200KM/h

AWS Big Data in 200KM/h

Lecturer: Omid Vahdaty ,10.5.2022

AWS Big Data ecosystem and architecture best practices. We will provide a quick overview of all the different big data services in AWS.

Video

Slides

Lecturer: Omid Vahdaty ,4.8.2019

Hebrew meetup


How to transform data (TXT, CSV, TSV, JSON) into Parquet, Which technology should we use to model the data? EMR, Athena, Redshift, Spectrum, Glue, Spark, or SparkSQL? How to handle streaming? How to manage costs? Performance tips, Security tip and cloud best practices tips

Hebrew Video

Slides


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn – Omid Vahdaty:

AWS athena

How to ignore quoted fields inside a CSV via AWS Athena?

The idea is to tell athena via the create table , to ignore quoted fields

CREATE external TABLE

create table myTable(
id bigint,
guid string)

ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
WITH SERDEPROPERTIES (
“separatorChar” = “,”,
“quoteChar” = “\””
)
STORED AS TEXTFILE
LOCATION ‘s3://my-bucket/’;

Also committed in out big data demystified github.

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

architecture, AWS athena, AWS Big Data Demystified, AWS EMR, AWS Redshift, Data Engineering, EMR, Spark

AWS Big Data Demystified – Part 1 [English]

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…

Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.

how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?

In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically – if it is related to big data – this is THE meetup.

Some of our online materials (mixed content from several cloud vendor):

Website:

https://big-data-demystified.ninja (under construction)

Meetups:

Big Data Demystified

Tel Aviv-Yafo, IL
494 Members

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The techn…

Next Meetup

Big Data Demystified | From Redshift to SnowFlake

Sunday, May 12, 2019, 6:00 PM
23 Attending

Check out this Meetup Group →

AWS Big Data Demystified

Tel Aviv-Yafo, IL
635 Members

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The techn…

Check out this Meetup Group →

You tube channels:

https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber

https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

Audience:

Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

architecture, AWS athena, meetup

Serverless Data Pipelines

We had the pleasure to host Michael Haberman, Founder at Topsight :

Serverless is the new kid in town but lets not forget data which is also critical for your organisation, in this talk we will look at the benefits of going serverless with your data pipeline, but also the challenges it raises. This talk will be heavily loaded with demos so watch out!

AWS Big Data Demystified | Serverless data pipeline

Sunday, Mar 3, 2019, 6:00 PM

Investing.com
Ha-Shlosha St 2 Tel Aviv-Yafo, IL

56 Members Went

Agenda: 18:00 networking and gathering 18:30 “A Polylog about Redis” , Itamar Haber 19:15 “Serverless data pipeline” , Michael Haberman Lecturer : Itamar Haber, Technology Evangelist —————————————————————- Bio: a self proclaimed “Redis Geek”, Itamar is the Technology Evangelist at Redis Labs, the home of op…

Check out this Meetup →

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS athena

AWS Athena Error: Query exhausted resources at this scale factor

AWS Athena Error: Query exhausted resources at this scale factor

Author: Omid Vahdaty 12.2.2019​

Athena is a Serverless technology.  i.e. It makes use of shared resources available with AWS and hence, when large amount of queries are submitted by users concurrently around the world at the same time, sometimes resource exhaustion take place. 
Athena service team has identified this as a known issue.

However, this error is transient in nature,  if you can submit the query again, it might be successful.
If you repeatedly get the same error consistently, then you might need to partition your data and optimize the query further as mentioned in Performance Tuning Best Practices for Athena. Another option is to follow this blog-  Tips to reduce costs on AWS SQL Athena , which might reduce resource consumption.

AWS support team suggestions:

  1. Avoid submitting queries at the beginning or end of an hour. If query fails, Back off exponentially by some minutes and try to submit query again. [ Wierd, but thats an official answer…]
  2.  highly recommended to adopt Amazon Athena best practices  to optimize your query and your data.
  3. Use columnar formatted data which can drastically reduce the resource consumption.

——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn: