How to implement Apache Iceberg in AWS Athena

Author: Shneior Dicastro 28.7.2022

Motivation

In this blog we are going to explain step by step how to implement Iceberg on AWS Athena.

What is Apache Iceberg?

Apache Icebergs is a new open table format, which allows ACID operations on databases which do not support ACID operations such as AWS Athena.

What are ACID operations?

The ACID properties define SQL database key properties to ensure consistent, safe and robust database modification when saved. ACID is an acronym that helps to remember the fundamental principles of a transnational system. ACID stands for Atomic, Consistent, Isolation, and Durability.

Pros of AWS Athena & Iceberg combined:

Simple table creation on top of S3.
Full schema evolution, Simple update, delete, alter table functionality.
Simple maintenance.
Transactional consistency, with full read isolation, multiple concurrent writes.
Time travel to verify changes between updates.
Partition evolution enabling updates to partition schemes .
Rollback to prior versions.
Advanced planning and filtering capabilities for high performance.

Cons of AWS Athena & Iceberg combined:

Each ACID operation will create a snapshot, slowing your query down.
If snapshots are not cleared, the content of AWS s3 bucket will be inflated.
Maintenance operations require using EMR, not directly from Athena.

ACID operations examples using Iceberg on AWS Athena

In this example we are going to show how to use iceberg on AWS Athena tables
Step 1 : Create table example on AWS Athena using Iceberg

CREATE TABLE iceberg_table (id bigint, data string, category string)
PARTITIONED BY (category, bucket(16, id))
LOCATION ‘s3://bucket/iceberg_table/’
TBLPROPERTIES ( ‘table_type’ = ‘ICEBERG’ )

Insert INSERT INTO iceberg_table (id, data, category) VALUES (1,’a’,’c1′);

Step 2: Add column field example on AWS Athena using Iceberg

ALTER TABLE iceberg_table
ADD COLUMNS (points string);

Step 3: Delete column field example on AWS Athena using Iceberg

ALTER TABLE iceberg_table DROP COLUMN points;

Step 4: Update values example on AWS Athena using Iceberg

INSERT INTO iceberg_table (id, data, category) VALUES (1,’a’,’c1′);
UPDATE iceberg_table
SET points_int=1;

Step 5: Rename column field example on AWS Athena using Iceberg

ALTER TABLE iceberg_table CHANGE points points2 string;

Step 6: Change columns data type example on AWS Athena using Iceberg

ALTER TABLE iceberg_table ADD COLUMNS (points_int int);
ALTER TABLE iceberg_table CHANGE points_int points_int bigint;

Operational maintenance on AWS Athena using Iceberg

The following iceberg commands can be used directly from athena, essentially ignoring the historical snapshots to improve performance, however – it does not delete the snapshots, simply ignores them.

Optimize
optimize iceberg_table REWRITE DATA
USING BIN_PACK

Why should we call procedure expire_snapshots (require EMR)?

Each write to an Iceberg table creates a new snapshot, or version, of a table. Snapshots can be used for time-travel queries, or the table can be rolled back to any valid snapshot.

Snapshots accumulate until they are expired by the expireSnapshots operation. Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small.

How to integrate operational maintenance of AWS Athena and Iceberg via Airflow

In order to integrate iceberg you need to clean the automated historical snapshots created via iceberg periodically, otherwise your S3 bucket will inflated. Below are examples of of Airflow dag triggering and AWS EMR and then using EMR steps to run a pyspark script.

Full Airflow Dag Example to run EMR and then runa pyspark code to clear snapshots of iceberg (Note Airflow 2 was used)

This airflow dags will trigger an EMR and run as a step the iceberg pyspark script to clear partitions.
https://github.com/omidvd79/Big_Data_Demystified/blob/master/AWS/iceberg/full_airflow_example_clear_snapshots.py

Full example of pyspark running to clear iceberg snapshots

This script will erase all the historical partitions of the iceberg. Notice this is a pyspark syntax. https://github.com/omidvd79/Big_Data_Demystified/blob/master/AWS/iceberg/iceberg_expire_snapshot_pyspark_example.py

AWS Reference & Documentation

https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html
https://iceberg.apache.org/docs/latest/
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg.html

——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn – Omid Vahdaty: