AWS athena

AWS Athena how to work with JSON

AWS Athena how to work with JSON

Author: Ariel Yosef

How to query a nested json in AWS Athena

Json can contain nested values. those values can be represented as “key=value” or “array of values” or “array of key=value”

How to query “key=value”

name age cars
John 30 {car1=Ford, car2=BMW, car3=Fiat}
maria 21 {car1=Ford, car2=BMW, car3=Fiat}
sonia 25 {car1=Ford, car2=BMW, car3=Fiat}

We can see the column “cars” has a column with “keys and values”

In order to query that, we will simply query the keys and values column (cars) and add “.key”.

Let’s say we want to extract the car2 from column cars 

SELECT name , age ,cars.car2 as car2, FROM our_table

note: “as” statement just gives the column name.

Note: the car2 is the key we want to get from the original table.

name age car2
John 30 BMW
maria 21 BMW
sonia 25 BMW

How to Query an array:

department users
engineering [Sharon, John, Bob, Sally]

We need to unnest the array column and to connect it with the original table.

SELECT department, names FROM dataset
CROSS JOIN UNNEST(users) as t(names)

department users
engineering Sharon
engineering John
engineering Bob
engineering Sally

Here we extract the values from the array and attach them to the old table.

How to query Nested JSON with array of key values:

city worker
NYC [{name=Sally, department=engineering, age=31}, {name=John, department=finance, age=27}, {name=Amy, department=devops, age=41}]

To query that kind of data, we need first to unnest the array values and then select the column we want.

First we will add the unnested column by cross join and unnest:

select city,worker_unnested from dataset
cross join unnest(dataset.worker) as t(worker2)

city worker_unnested
NYC {name=Sally, department=engineering, age=31}
NYC {name=John, department=finance, age=27}
NYC {name=Amy, department=devops, age=41}

Now we can add to our query the specific column we want to query by using the “.key”:

select city,worker_unnested.name,worker_unnested.department,worker_unnested.age from dataset
cross join unnest(dataset.worker) as t(worker_unnested)

city worker_unnested.name worker_unnested.department worker_unnested.age
NYC Sally engineering 31
NYC John finance 27
NYC Amy devops 41

Watch another relevant post: AWS Athena Cheat Sheet

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

AWS athena

AWS Athena Cheat sheet

AWS Athena Cheat sheet

Author: Ariel Yosef

In AWS Athena the application reads the data from S3 and all you need to do is define the schema and the location the data is stored in s3, i.e create tables.

AWS Athena also saves the results of the queries you make , So you will be asked to define the results bucket before you start working with AWS Athena. 

How to create a table over CSV in AWS Athena (read data from s3 csv)

CREATE EXTERNAL TABLE (
  `col1` string, 
  `col2` int, 
  `col3` date (yyyy-mm-dd format), 
  `col4` timestamp (yyyy-mm-dd  hh:mm:ss format),
  `col5` boolean)
  ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ','
  LOCATION
  's3://bucket/folder'

NOTE: ROW FORMAT defines how Athena is going to read the file.

You can also define a SERDE to tell Athena how to parse the data.

You can watch all supported SERDE in Amazon Athena here

How to create a table  over json files – AWS Athena

CREATE EXTERNAL TABLE (
  `col1` string, 
  `col2` int, 
  `col3` date (yyyy-mm-dd format), 
  `col4` timestamp (yyyy-mm-dd  hh:mm:ss format),
  `col5` boolean)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
  LOCATION
  's3://bucket/folder'

To delete table

DROP TABLE  

How to create json with nested values on AWS Athena

CREATE EXTERNAL TABLE (
  `col1` struct<car1:string,car2:string,car3:string>, 
  `col2` int, 
  `col3` date (yyyy-mm-dd format), 
  `col4` timestamp (yyyy-mm-dd  hh:mm:ss format),
  `col5` boolean)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
  LOCATION
  's3://bucket/folder'

There are few more things Athena can do. It can transform your files into another hadoop storage format  such as  parquet, avro, and orc. We will use it for our queries, to reduce costs and improve performance.
Before we get into that, we need to understand CTAS.

CTAS – How to create a table from another table 

CREATE TABLE  as 
Select * from 

Now after we understand that we can use CTAS and apply the transformation and compression queries, we can ran the following example:

CREATE TABLE 
WITH (
      format = 'PARQUET’,
      parquet_compression = 'SNAPPY',
      external_location = 's3://bucket/folder/')
AS SELECT * 
FROM ;

Notice:

  1. Available compressions are: 4mc, snappy, lzo, lz4, bzip2 and gzip
  2. Available storage formants: parquet, avro and orc
  3. PPT to compare the different file formats.
  4. CTAS examples of AWS Athena documentation.

  AVRO PRAQUET ORC
Schema Evlution Most efficient, as the schema is stores as JSON with the file
score: 10/10
Scheman evolution is expensive as it needs to be read and merged across all the parquet files
score:3/10

Schema evolution here is limited to adding new columns and a few cases of column type-widening

score:5/10

Compression less efficient
score:5/10
Best with Snappy
score:7/10
Best with ZLIB
score:5/10
Splitability Partially
score:7/10
Partially
score:7/10
Fully
score:10/10
ROW/Column Oriented Row Row Column
Optimied Processing Engines Kafka Spark Hive, Presto
Read/Write Write Read Read

Insert into

insert into  select * from 

Example for partitioning table done by Athena:

CREATE TABLE 
WITH (
      format = 'PARQUET’,
      parquet_compression = 'SNAPPY',
      external_location = 's3://bucket/folder/'
      partitioned_by = ARRAY['col1']['col2']
)
AS SELECT * 
FROM ;

Watch our Post AWS Athena how to work with JSON

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

architecture, AWS, AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Redshift, AWS S3, EMR, SageMaker, Security

AWS Demystified – Comprehensive training program suggestion for newbies in 200KM/h

This blog Assume prior knowledge, this to help the reader design training program to newbies on AWS. Naturally, my big data perspective is applied here.

Learn the following in rising order of importance (in my humble opinion).

General lightweight introduction to AWS & AWS Big Data :

  1. Create AWS user on AWS account

Start with this. get the obvious out of the way.

  1. AWS S3 (GCS), AWS S3 cli

Be sure to understand the basics, upload, download, copy, move, rsnc from both the GUI and AWS CLI. only then go to other advanced features such as life cycle policy, storage tiers, encyrptions etc.

  1. AWS EC2(Elastic Compute), how to create a machine, how to connect via ssh

Be sure to use T instance to play a round, choose amazon linux or ubuntu. Notice the different OS users name required to ssh to each machine.

Be sure to understand what is

  1. SSH
  2. SSH tunnel
  1. AWS security groups how to add ip and port

Without this section you wont be to access web/ssh machines.

  1. AWS VPC (virtual private network) , only if you feel comfortable around network architecture, otherwise skipt this topic.
  1. AWS RDS (mySQL,aurora)

create Mysql , login, create table, insert data from S3, export data from s3.

understand the difference between AWS RDS aurora and AWS RDS mysql

  1. AWS Redshift learn how to create cluster connect to cluster, and run query , understand the basic architecture.
  2. AWS Athena, how to create external table, how to insert data, partitions, MSCK repair table, parquet, AVRO, querying nested data.
  1. AWS Glue

be sure to understand the 2 role of AWS glue: shared metastore and auto ETL features.

  1. AWS Kinesis, Stream, Firehose, Analytics, you need to understand messaging and streaming. I covered off topic subject here such as flume and kafka.
  1. AWS EMR – (need to talk about with omid).

This is HARD CORE material. highly advanced materials for enterprise grade lectures.

Be sure to understand when to use Hive and when to use SparkSQL and when to use Spark Core. Moreover, the difference between the different node: core node, task node, master node.

  1. AWS IAM, groups, user, role based architecture, encryption at rest, at motion, based policy, identity based policy. By now you should have a basic understanding of what is IAM about. Identity, Authentication, authorization. although, there are many fine grain security issues you need to understand. At the very minimum be sure to understand what is the difference between a role and user, how to write a custom policy, and what is resource based policy vs identity based policy.
  1. AWS Cloud best practices, light wight
  1. AWS ELB, ALB, Auto scaling. [Advanced]
  1. AWS Route53, TBD
  2. AWS security
  1. AWS Lambda
  2. AWS Sage Maker

TBD subject:

API Gateway

AWS cognito.


——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS, AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Lambda, AWS Redshift, Hive, meetup, Uncategorised

200KM/h overview on Big Data in AWS | Part 2

in this lecture we are going to cover AWS Big Data PaaS technologies used to model and visualize data using a suggested architecture and some basic big data architecture rule of thumbs.

For more meetups:
https://www.meetup.com/Big-Data-Demystified/

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS, AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Lambda, AWS Redshift, Hive

200KM/h overview on Big Data in AWS | Part 1

in this lecture we are going to cover AWS Big Data PaaS technologies used to ingest and transform data. Moreover, we are going to demonstrate a business use case, suggested architecture, some basic big data architecture rule of thumbs.

For more meetups:
https://www.meetup.com/Big-Data-Demystified/

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/