How save costs on AWS SQL Athena? Cost of using AWS SQL Athena is killing you?

consider the below

Did you switch to columnar? if not try the this link as reference: convert to columnar from raw based data.
Did you use parquet or orc? one of them take less space.
Did u use partitioning? did you use the correct partitioning for your query?
If using ORC consider using bucketing on top of partitioning, not sure if Athena supports this. confirming this. TBD.
If using highly nested data , perhaps AVRO will be your space saver, need to test this.
Did you compress via gzip? or something else? there more compressions supported, each have there own storage put print and scan put print….
Was your data spliced into chunks? if so try to change chunk size. more complicated but doable, again, could go either way – need to test this will your data.
Apply hints on the table may help on data scan in some cases. not sure if Athena supports this. confirming this. TBD.
If using multiple tables join, order of joins, may impact scanned data
Consider pre aggregating data if possible as part of your transformation/cleansing process. even if it is on each (using window table, each row will hold aggregation tables. )
Consider pre calculating table with heavy group by on raw-data. i.e have the data already calculated on s3, and have your production user/ end user query that table.
If your application permits this – using caching layer like elastic cache. remember Athena has it own caching as well (results are saved for 24 hours)
have a data engineer review each query, to make sure data scan is minimised. for example
1. Minimise the columns in the results set… a results set of longs strings maybe be very costly.
2. where possible switch strings to ints, this will minimise footprint on storage greatly.
3. if possible switch from bigint to tinyint. this will save some disk space as well. notice the list of supported data types: https://prestodb.io/docs/current/language/types.html
Consider EMR with presto and task group with spots + auto scaling – you could have a tighter control on max budget that will be used, and in some cases it may be faster running on AWS EMR.
Use bucketing ( each partition can be divided into smaller parts. official docs: https://docs.aws.amazon.com/athena/latest/ug/bucketing-vs-partitioning.html

16.Benchmarking different types of Hadoop / Spark / Hive / Presto storage and compressions types.

Using this blog on convert raw based data to columnar on tpch data , I created new destinations tables , and converted to different type storage and compression. For completeness I used popular ORC and parquet the following compressions types GZIP, GZIP, Deflate, LZO, Zlib. Afterwards I ran a simple preview query on each of the 5 compressions below:

SELECT * FROM “tpch_data”.”lineitem_compresation_name” limit 10

Results on data read via Athena on cold queries (data scanned only once, after 72 hours):

Parquet, GZIP: (Run time: 4.8 seconds, Data scanned: 84MB)
Parquet, BZIP: (Run time: 6.0 seconds, Data scanned: 242MB)
Parquet, Deflate: (Run time: 5.81 seconds, Data scanned: 242MB)
Parquet, LZO: (Run time: 9 seconds, Data scanned: 15GB)
ORC, Zlib : (Run time: 10.1 seconds, Data scanned: 11GB)
Text,GZIP: (Run time: 5.9 seconds, Data scanned: 49MB)

Results on data read via Athena on hot queries (data scanned several times):

Parquet, GZIP: (Run time: 6 seconds, Data scanned: 2.5GB)
Parquet, BZIP: (Run time: 6.39 seconds, Data scanned: 4.84GB)
Parquet, Deflate: (Run time: 4.88 seconds, Data scanned: 5.31GB)
Parquet, LZO: (Run time: 5.77 seconds, Data scanned: 8.97GB)
ORC, Zlib : (Run time: 6.5 seconds, Data scanned: 5.07GB)

The results are not surprising as there are many compressions, and each can behave differently on different data types (strings, ints, floats, rows, columns) . notice the scan time is more or less the same.

I ran a columnar test on a specific int column instead of all columns in this queries, and the results were the same, as all compressions read the same amount of giga scanning this specific column. again notice the different running time.

SELECT avg( l_orderkey) FROM “tpch_data”.”lineitem_parquet”

Parquet, GZIP: (Run time: 6.84 seconds, Data scanned: 3.77GB)
Parquet, BZIP: (Run time: 9.03 seconds, Data scanned: 3.77GB)
Parquet, Deflate: (Run time: 9 seconds, Data scanned: 3.77GB)
Parquet, LZO: (Run time: 9.94 seconds, Data scanned: 3.77GB)
ORC, Zlib : (Run time: 11.27 seconds, Data scanned: 3.77GB)

Again, This test should not convince you to prefer one storage type over the other, nor to prefer one compression type over the other. Test it on you data, and understand the differences in your use case.

and don’t forget monthly AWS s3 Storage costs….

Old data can be reduced to reduced availability.
how much data is read in your queries. if your invoice of Athena is 500$, this means you are reading 100 TB of compressed data. so… can you minimised the amount of data read in your query – or can your reduce the history from 2 years to 1 year ?

Conclusion on cost reduction using AWS SQL Athena

As you can see, you could be saving a 50% or more. easily on your AWS SQL Athena costs simply by changing to the correct compression.
Check the running time, be sure it is a non issues for your use case.
regarding the text vs parquet, be sure to understand the use-case, not always you need to extract all the rows, thus column based storage, will be more use-full.
if you have any more tips to reduce costs – please contact me if you are willing to share 🙂

Resources:

https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

https://docs.aws.amazon.com/athena/latest/ug/service-limits.html

https://docs.aws.amazon.com/athena/latest/ug/work-with-data.html

Need to learn more about aws big data?

Contact me via linked in Omid Vahdaty
website: https://amazon-aws-big-data-demystified.ninja/
Join our meetup, FB group and youtube channel
- Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
- Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
- subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

16 Tips to reduce costs on AWS SQL Athena

16.Benchmarking different types of Hadoop / Spark / Hive / Presto storage and compressions types.

Conclusion on cost reduction using AWS SQL Athena

Resources:

Need to learn more about aws big data?

3 thoughts on “16 Tips to reduce costs on AWS SQL Athena”

Leave a ReplyCancel reply

16.Benchmarking different types of Hadoop / Spark / Hive / Presto storage and compressions types.

Conclusion on cost reduction using AWS SQL Athena

Resources:

Need to learn more about aws big data?

3 thoughts on “16 Tips to reduce costs on AWS SQL Athena”

Leave a ReplyCancel reply

Discover more from Big Data Demystified