AWS athena

AWS Athena Cheat sheet

AWS Athena Cheat sheet

Author: Ariel Yosef

In AWS Athena the application reads the data from S3 and all you need to do is define the schema and the location the data is stored in s3, i.e create tables.

AWS Athena also saves the results of the queries you make , So you will be asked to define the results bucket before you start working with AWS Athena. 

How to create a table over CSV in AWS Athena (read data from s3 csv)

CREATE EXTERNAL TABLE <table_name>(
  `col1` string, 
  `col2` int, 
  `col3` date (yyyy-mm-dd format), 
  `col4` timestamp (yyyy-mm-dd  hh:mm:ss format),
  `col5` boolean)
  ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ','
  LOCATION
  's3://bucket/folder'

NOTE: ROW FORMAT defines how Athena is going to read the file.

You can also define a SERDE to tell Athena how to parse the data.

You can watch all supported SERDE in Amazon Athena here

How to create a table  over json files – AWS Athena

CREATE EXTERNAL TABLE <table_name>(
  `col1` string, 
  `col2` int, 
  `col3` date (yyyy-mm-dd format), 
  `col4` timestamp (yyyy-mm-dd  hh:mm:ss format),
  `col5` boolean)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
  LOCATION
  's3://bucket/folder'

To delete table

DROP TABLE <table_name>

How to create json with nested values on AWS Athena

CREATE EXTERNAL TABLE <table_name>(
  `col1` struct<car1:string,car2:string,car3:string>, 
  `col2` int, 
  `col3` date (yyyy-mm-dd format), 
  `col4` timestamp (yyyy-mm-dd  hh:mm:ss format),
  `col5` boolean)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
  LOCATION
  's3://bucket/folder'

There are few more things Athena can do. It can transform your files into another hadoop storage format  such as  parquet, avro, and orc. We will use it for our queries, to reduce costs and improve performance.
Before we get into that, we need to understand CTAS.

CTAS – How to create a table from another table 

CREATE TABLE <table_name> as 
Select * from <database.table_name>

Now after we understand that we can use CTAS and apply the transformation and compression queries, we can ran the following example:

CREATE TABLE <table_name>
WITH (
      format = 'PARQUET',
      parquet_compression = 'SNAPPY',
      external_location = 's3://bucket/folder/')
AS SELECT * 
FROM <database.table_name>;

Notice:

  1. Available compressions are: 4mc, snappy, lzo, lz4, bzip2 and gzip
  2. Available storage formants: parquet, avro and orc
  3. PPT to compare the different file formats.
  4. CTAS examples of AWS Athena documentation.

  AVRO PRAQUET ORC
Schema Evlution Most efficient, as the schema is stores as JSON with the file
score: 10/10
Scheman evolution is expensive as it needs to be read and merged across all the parquet files
score:3/10

Schema evolution here is limited to adding new columns and a few cases of column type-widening

score:5/10

Compression less efficient
score:5/10
Best with Snappy
score:7/10
Best with ZLIB
score:5/10
Splitability Partially
score:7/10
Partially
score:7/10
Fully
score:10/10
ROW/Column Oriented Row Row Column
Optimied Processing Engines Kafka Spark Hive, Presto
Read/Write Write Read Read

Insert into

insert into  select * from <database.table_name>

Example for partitioning table done by Athena:

CREATE TABLE 
WITH (
      format = 'PARQUET',
      parquet_compression = 'SNAPPY',
      external_location = 's3://bucket/folder/'
      partitioned_by = ARRAY['col1']['col2']
)
AS SELECT * 
FROM <database.table_name>;

Watch our Post AWS Athena how to work with JSON

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

AWS athena

AWS Athena , Hive & Presto Cheat sheet

This is a quick “Cut the bullshit and give me what I Need” blog.

Your biggest problem in AWS Athena – is how to create table 🙂

Create table with separator pipe separator

CREATE   EXTERNAL TABLE logs (
    id STRING,
    query STRING
)
    ROW FORMAT DELIMITED
      FIELDS TERMINATED BY '|'
      ESCAPED BY '\\'
      LINES TERMINATED BY '\n'
    LOCATION 's3://myBucket/logs';

create table with CSV SERDE

CREATE EXTERNAL TABLE IF NOT EXISTS logs(
 `date` string,
 `query` string 
 )

 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
 LOCATION 's3://omidongage/logs'

Create table with partition and parquet

CREATE EXTERNAL TABLE users (
first string,
last string,
username string
)
PARTITIONED BY (id string)
STORED AS parquet
LOCATION 's3://bucket/folder/'

create table on local HDFS using pipe separator with header skip

create table lineitem (
    l_orderkey    BIGINT ,
    l_partkey     BIGINT ,
    l_suppkey     BIGINT ,
    l_linenumber  BIGINT ,
    l_quantity    DOUBLE ,
    l_extendedprice  DOUBLE ,
    l_discount    DOUBLE ,
    l_tax         DOUBLE ,
    l_returnflag  CHAR(1) ,
    l_linestatus  CHAR(1) ,
    l_shipdate    DATE ,
    l_commitdate  DATE ,
    l_receiptdate DATE,
    l_shipinstruct CHAR(25) ,
    l_shipmode     CHAR(10) ,
    l_comment      VARCHAR(44) 
)
row format delimited 
fields terminated by '|' 
lines terminated by '\n' 
location '/user/hive/warehouse/tpch/lineitem/'
tblproperties ("skip.header.line.count"="1");

CTAS Examples by official Examples – VERY good read!

https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html

Hive data types – official documentation

AWS Athena Documentation:

https://docs.aws.amazon.com/athena/latest/ug/create-table.html

https://docs.aws.amazon.com/athena/latest/ug/ctas.html


——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

architecture, AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Redshift, AWS S3, EMR, SageMaker, Security

AWS Demystified – Comprehensive training program suggestion for newbies in 200KM/h

This blog Assume prior knowledge, this to help the reader design training program to newbies on AWS. Naturally, my big data perspective is applied here.

Learn the following in rising order of importance (in my humble opinion).

General lightweight introduction to AWS & AWS Big Data :

  1. Create AWS user on AWS account

Start with this. get the obvious out of the way.

  1. AWS S3 (GCS), AWS S3 cli

Be sure to understand the basics, upload, download, copy, move, rsnc from both the GUI and AWS CLI. only then go to other advanced features such as life cycle policy, storage tiers, encyrptions etc.

  1. AWS EC2(Elastic Compute), how to create a machine, how to connect via ssh

Be sure to use T instance to play a round, choose amazon linux or ubuntu. Notice the different OS users name required to ssh to each machine.

Be sure to understand what is

  1. SSH
  2. SSH tunnel
  1. AWS security groups how to add ip and port

Without this section you wont be to access web/ssh machines.

  1. AWS VPC (virtual private network) , only if you feel comfortable around network architecture, otherwise skipt this topic.
  1. AWS RDS (mySQL,aurora)

create Mysql , login, create table, insert data from S3, export data from s3.

understand the difference between AWS RDS aurora and AWS RDS mysql

  1. AWS Redshift learn how to create cluster connect to cluster, and run query , understand the basic architecture.
  2. AWS Athena, how to create external table, how to insert data, partitions, MSCK repair table, parquet, AVRO, querying nested data.
  1. AWS Glue

be sure to understand the 2 role of AWS glue: shared metastore and auto ETL features.

  1. AWS Kinesis, Stream, Firehose, Analytics, you need to understand messaging and streaming. I covered off topic subject here such as flume and kafka.
  1. AWS EMR – (need to talk about with omid).

This is HARD CORE material. highly advanced materials for enterprise grade lectures.

Be sure to understand when to use Hive and when to use SparkSQL and when to use Spark Core. Moreover, the difference between the different node: core node, task node, master node.

  1. AWS IAM, groups, user, role based architecture, encryption at rest, at motion, based policy, identity based policy. By now you should have a basic understanding of what is IAM about. Identity, Authentication, authorization. although, there are many fine grain security issues you need to understand. At the very minimum be sure to understand what is the difference between a role and user, how to write a custom policy, and what is resource based policy vs identity based policy.
  1. AWS Cloud best practices, light wight
  1. AWS ELB, ALB, Auto scaling. [Advanced]
  1. AWS Route53, TBD
  2. AWS security
  1. AWS Lambda
  2. AWS Sage Maker

TBD subject:

API Gateway

AWS cognito.


——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Lambda, AWS Redshift, Hive, meetup, Uncategorised

200KM/h overview on Big Data in AWS – Part 2

in this lecture we are going to cover AWS Big Data PaaS technologies used to model and visualize data using a suggested architecture and some basic big data architecture rule of thumbs.

For more meetups:
https://www.meetup.com/Big-Data-Demystified/

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Lambda, AWS Redshift, Hive

200KM/h overview on Big Data in AWS – Part 1

in this lecture we are going to cover AWS Big Data PaaS technologies used to ingest and transform data. Moreover, we are going to demonstrate a business use case, suggested architecture, some basic big data architecture rule of thumbs.

For more meetups:
https://www.meetup.com/Big-Data-Demystified/

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/