AWS Athena Cheat sheet

Author: Ariel Yosef

In AWS Athena the application reads the data from S3 and all you need to do is define the schema and the location the data is stored in s3, i.e create tables.

AWS Athena also saves the results of the queries you make , So you will be asked to define the results bucket before you start working with AWS Athena.

How to create a table over CSV in AWS Athena (read data from s3 csv)

CREATE EXTERNAL TABLE <table_name>(
  `col1` string, 
  `col2` int, 
  `col3` date (yyyy-mm-dd format), 
  `col4` timestamp (yyyy-mm-dd  hh:mm:ss format),
  `col5` boolean)
  ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ','
  LOCATION
  's3://bucket/folder'

NOTE: ROW FORMAT defines how Athena is going to read the file.

You can also define a SERDE to tell Athena how to parse the data.

You can watch all supported SERDE in Amazon Athena here

How to create a table over json files – AWS Athena

CREATE EXTERNAL TABLE <table_name>(
  `col1` string, 
  `col2` int, 
  `col3` date (yyyy-mm-dd format), 
  `col4` timestamp (yyyy-mm-dd  hh:mm:ss format),
  `col5` boolean)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
  LOCATION
  's3://bucket/folder'

To delete table

DROP TABLE <table_name>

How to create json with nested values on AWS Athena

CREATE EXTERNAL TABLE <table_name>(
  `col1` struct<car1:string,car2:string,car3:string>, 
  `col2` int, 
  `col3` date (yyyy-mm-dd format), 
  `col4` timestamp (yyyy-mm-dd  hh:mm:ss format),
  `col5` boolean)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
  LOCATION
  's3://bucket/folder'

There are few more things Athena can do. It can transform your files into another hadoop storage format such as parquet, avro, and orc. We will use it for our queries, to reduce costs and improve performance.
Before we get into that, we need to understand CTAS.

CTAS – How to create a table from another table

CREATE TABLE <table_name> as 
Select * from <database.table_name>

Now after we understand that we can use CTAS and apply the transformation and compression queries, we can ran the following example:

CREATE TABLE <table_name>
WITH (
      format = 'PARQUET',
      parquet_compression = 'SNAPPY',
      external_location = 's3://bucket/folder/')
AS SELECT * 
FROM <database.table_name>;

Notice:

Available compressions are: 4mc, snappy, lzo, lz4, bzip2 and gzip
Available storage formants: parquet, avro and orc
PPT to compare the different file formats.
CTAS examples of AWS Athena documentation.

	AVRO	PRAQUET	ORC
Schema Evlution	Most efficient, as the schema is stores as JSON with the file score: 10/10	Scheman evolution is expensive as it needs to be read and merged across all the parquet files score:3/10	Schema evolution here is limited to adding new columns and a few cases of column type-widening score:5/10
Compression	less efficient score:5/10	Best with Snappy score:7/10	Best with ZLIB score:5/10
Splitability	Partially score:7/10	Partially score:7/10	Fully score:10/10
ROW/Column Oriented	Row	Row	Column
Optimied Processing Engines	Kafka	Spark	Hive, Presto
Read/Write	Write	Read	Read

Insert into

insert into  select * from <database.table_name>

Example for partitioning table done by Athena:

CREATE TABLE 
WITH (
      format = 'PARQUET',
      parquet_compression = 'SNAPPY',
      external_location = 's3://bucket/folder/'
      partitioned_by = ARRAY['col1']['col2']
)
AS SELECT * 
FROM <database.table_name>;

Watch our Post AWS Athena how to work with JSON

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

AWS Athena Cheat sheet

Author: Ariel Yosef

Discover more from Big Data Demystified