AWS athena

Error: HIVE_PATH_ALREADY_EXISTS: Target directory for table | error while CTAS in AWS Athena

Error: HIVE_PATH_ALREADY_EXISTS: Target directory for table

Author: Omid Vahdaty 11.2.2019

I got this error while working in Athena.

The good News- CTAS in Athena exist.
https://docs.aws.amazon.com/athena/latest/ug/ctas.html

The bad news- the error messages still require some work to make them human readable….

"HIVE_PATH_ALREADY_EXISTS: Target directory for table 'sampledb.ctas_json_partitioned' already exists: s3://myBucket/."

In order to run the CTAS command in Athena, it is always preferable to use an explicit location. If you don’t specify the explicit location, Athena uses this location by default :

s3://aws-athena-query-results--///<month///.

For example, if we are running following CTAS query with location ‘s3://mybucket/my_ctas_table/’:

CREATE my_ctas_table WITH ( external_location = 's3://mybucket/my_ctas_table/' ) AS SELECT ...

In this example, we’d first want to make sure the location (exists !) and is empty.

To check the location, use the CLI command:

aws s3 ls s3://aws-athena-query-results-783591744239-eu-west-1/Unsaved/2019/02/03/tables/ --recursive

——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,/

feel free to contact me via LinkedIn:

AWS athena

Gibberish in AWS Athena? instead of Hebrew ?

The problem is usually the orignal encoding of the source file.

But sometime it is about the end of line problem from different OS, just use dos2unix and try not to open the files in windows OS systems.

brew install dos2unix

dos2unix filename.csv

 

check the file type in linux CMD:

file -I filename.csv

the result should be something like the below. any other encoding like iso-8859-8 or UTF -8 should produce Hebrew, so your file will be probably in different encoding

text/plain; charset=iso-8859-1

Than the challenge would the convert…

3 ways to convert your text gibrish file into Hebrew:

  1. Microsoft XL , rename the file to filename.txt and open file, it will open a wizard letting you choose the encoding.
  2. Linux CMD:

    iconv -f iso-8859-1 -t utf-8 < file > file.new

  3. online encoding convertor to utf 8 i used : https://subtitletools.com/convert-text-files-to-utf8-online

 

Trying testing the file locally – if you see Hebrew on your desktop , you should be fine on Athena.

Have fun!

—————————————————————————————————–



——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/



architecture, AWS athena, AWS Big Data Demystified, AWS EMR, AWS Redshift, Data Engineering, Hive, meetup

AWS Big Data Demystified – Part 2

The video ( to forget to subscribe to our youtube channel to help our community)

 

Lectures slides:

Some follow up questions from the meetup attendees via email (answer may be inlined):

1. I did all the steps as you guided in the EMR lecture (#2) including installing Presto and setting on the Glue check mark (for all : Presto,spark and hive)
 Then I created schema , table , parquet table (in Athena) and run in hive “insert overwrite”  – exactly the same way as you guided in the lecture
Now when I open Hue – in Hive I see the new DB (tpch_data) but in Presto I don’t see this db
see the attachment
Moreover Presto SQLs are not running (not finish) when I run Presto from Hue
Might be the problem is that I created db from Athena?
if you create the table in hive, you have to add the database before table name
otherwise it will be created on local database of hive. 
 
I have not used presto except for testing (when i created the architecture), i believe it is a common mistake we do here in my team as well.
 
if you have schema created in athena called sampleDB
in hive (or where ever ever u create a table )
you must use 
create table sampleDB.tableName ()
this should do the trick. notice the list of databases on the left in hue (default, tpchdata), open then to see where the table was created.
 
 if not, let me know, we can resolve this via screen sharing.
2. Zeppelin is not loading. When I click on Zeppelin link in AWS EMR page the page is not loading…
All others apps load good – only Zeppling is stuck
did you open the SG for the port 8890?
are using tunnel?
did u use foxy proxy to enable access on your browser
did you configure something special to zeppelin? 
FW issue on your office side?
try restarting the zeppelin sudo stop zeppelin, sudo start zeppelin
if all  the above fails, screen share will help 🙂
3. ARCHITECTURE
1. Can you share your retention policies for the data?
I mean according to your architecture you do:
1. gzip texts files and load it to S3 dedicated raw data buckets  (gzipped)
2. clean, transfrom to parquet (via hive) and save the parquet files in other S3 buckets hierarchy (
3. module, data enrichemnt, flattening etc.. and again I guess you have here other S3 buckets  hierarchy
In bottom line you have 3 logical layers of data :
a. original text
b. original, cleaned parquet uncomporessed
c. moduled parquet gzip
[same same but different. I keep one bucket per data source, with 3 “folders” inside , each for folder per layer above u mentioned above. this is due to different restriction i have from business side on each data source for example different encryption \ retention \ access management] 
The questions are :
 3.1. Do you apply the same retention policy to all layers?
yes and no.
No: b/c i treat raw data differently. i delete it after i finished initial transformation to parquet and gzip. (no cleansing or changes)
different encryption policies for each data source.
yes: once the data is transformed and modelted, i keep it for 120 days. ( cookie live cycle is 90 days)  
3.2.How long do you keep it?
120 days , per use case. 
3.3. Do you use / plan use AWS Glacier for old data?
no need for COLD data in my use case. 🙂 check the costs … 
3.4. How do you handle GDPR? If someone ask to delete data 6 months old . You can not delete from parquet so the only way to do this is : delete the entire file and regenerate it. For this you need some kind of index to find the target for deletion files (for all 3 layers)
Do you have something like that?
keep the data partitioned per day, have a s3 lifecycle policy on the bucket for 180 days (pert specific path) , use msck daily, and use hive external tables + dynamic partitioning  for inserts. this way , when u delete data on bucket level, and use msck, the table is “updated” with removed partitions, and the jobs are not failing.
 
Note – there is not insert overwrite and dynamic partitioning in presto…
 
as for GDPR…
we started working on obfuscation on the data and delete the raw data, and keeping the obfuscated data anonymized and adding opt in and out procedures, and checking if the citizen is an EU citizen. we have much work to  do. [ see lecture 4….]
 
3. Presto+Zeppelin vs. Athena
   I understand all advantages of managed service but in this specific case I don’t see many
   Install dediacted EMR cluster with Presto , Zeppelin and Ganglia, working with extrenal tables , data in s3 . NO Hadoop .  + Autoscale
Why do I need Athena?? What do I miss ?
this is a good question. yes, both are applicable. it is a matter of cost/performance ratio and more.if you have many queries per day, the cost of athena will be high. as the cost is mainly on TB read per day. however if you have same query running over and over in 24 hours, the results are cached. so the cost will be lower. 
so in the end of the day it is a matter of use case, expected concurrency, budget, and performance targets, simplicity VS flexibility (managed notebooks is not a feature of athena, but for ad hoq query/ minimal BI backend it should do the trick). 
 
what will be cheaper? need to know how many TB’s are read to answer this best way. as atehna uses on demand resources, and EMR may use spot instance and auto scaling.
what will be faster? could go either way depending on the cluster , as EMR are dedicated resources
 
Here is the link (you probably familiar with) how to integrate presto with zeppelin
The crazy stuff is that with Presto you can query Kafka and Elastic . So theoretically you can build more hollistic solutions without Athena. Also less cloud lock
sounds great! assuming u need to query kafka and elastic.
it will take a bit longer to implement, as there will probably some learning curve, but other than that, it sounds great. i won’t not worry about vendor lock 🙂

Technical comments from the audience sent via email:

Hi Omid, In your #2 presentation in the slides about Hive you use ‘org.openx.data.jsonserde.JsonSerDe’
This is very old serde and indeed you mentioned it in the slide But the problem with it if you create with this serde table in Hive – then you can NOT work with this table in Presto. When you run any query in Presto on this table – you get the exception :
deserializer does not exist: org.openx.data.jsonserde.JsonSerDe at com.facebook.presto.jdbc.PrestoResultSet.resultsException(PrestoResultSet.java
The correct serde to use is ‘org.apache.hive.hcatalog.data.JsonSerDe’ It’s in Presto classpath And so Presto works properly when you switch to it

————————————————-

Special thanks to vlady for the feedback 🙂


——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

architecture, AWS athena

16 Tips to reduce costs on AWS SQL Athena

How save costs on AWS SQL Athena? Cost of using AWS SQL Athena is killing you?

consider the below

  1. Did you switch to columnar? if not try the this link as reference: convert to columnar  from raw based data.
  2. Did you use parquet or orc? one of them take less space.
  3. Did u use partitioning? did you use the correct partitioning for your query?
  4. If using ORC consider using bucketing on top of partitioning, not sure if Athena supports this. confirming this. TBD.
  5. If using highly nested data , perhaps AVRO will be your space saver, need to test this.
  6. Did you compress via gzip? or something else?  there more compressions supported, each have there own storage put print and scan put print….
  7. Was your data spliced into chunks? if so try to change chunk size. more complicated but doable, again, could go either way – need to test this will your data.
  8. Apply hints on the table may help on data scan in some cases. not sure if Athena supports this. confirming this. TBD.
  9. If using multiple tables join, order of joins, may impact scanned data
  10. Consider pre aggregating data if possible as part of your transformation/cleansing process. even if it is on each (using window table, each row will hold aggregation tables. )
  11. Consider pre calculating table with heavy group by on raw-data. i.e have the data already calculated on s3, and have your production user/ end user query that table.
  12. If your application permits this – using caching layer like elastic cache. remember Athena has it own caching as well (results are saved for 24 hours)
  13. have a data engineer review each query, to make sure data scan is minimised. for example
    1. Minimise the columns in the results set… a results set of longs strings maybe be very costly.
    2. where possible switch strings to ints, this will minimise footprint on storage greatly.
    3. if possible switch from bigint to tinyint. this will save some disk space as well. notice the list of supported data types: https://prestodb.io/docs/current/language/types.html
  14. Consider EMR with presto and task group with spots + auto scaling – you could have a tighter control on max budget that will be used, and in some cases it may be faster running on AWS EMR.
  15. Use bucketing ( each partition can be divided into smaller parts. official docs: https://docs.aws.amazon.com/athena/latest/ug/bucketing-vs-partitioning.html

16.Benchmarking different types of Hadoop / Spark / Hive / Presto storage and compressions types.

Using this blog on convert raw based data to columnar on tpch data , I created new destinations tables , and converted to different type storage and compression. For completeness I used popular ORC and parquet the following compressions types GZIP, GZIP, Deflate, LZO, Zlib. Afterwards I ran a simple preview query on each of the 5 compressions  below:

SELECT * FROM “tpch_data”.”lineitem_compresation_name” limit 10

Results on data read via Athena on cold queries (data scanned only once, after 72 hours):

  • Parquet, GZIP: (Run time: 4.8 seconds, Data scanned: 84MB)
  • Parquet, BZIP: (Run time: 6.0 seconds, Data scanned: 242MB)
  • Parquet, Deflate: (Run time: 5.81 seconds, Data scanned: 242MB)
  • Parquet, LZO: (Run time: 9 seconds, Data scanned: 15GB)
  • ORC, Zlib : (Run time: 10.1 seconds, Data scanned: 11GB)
  • Text,GZIP:  (Run time: 5.9 seconds, Data scanned: 49MB)

Results on data read via Athena on hot queries (data scanned several times):

  1. Parquet, GZIP: (Run time: 6 seconds, Data scanned: 2.5GB)
  2. Parquet, BZIP: (Run time: 6.39 seconds, Data scanned: 4.84GB)
  3. Parquet, Deflate: (Run time: 4.88 seconds, Data scanned: 5.31GB)
  4. Parquet, LZO: (Run time: 5.77 seconds, Data scanned: 8.97GB)
  5. ORC, Zlib : (Run time: 6.5 seconds, Data scanned: 5.07GB)

The results are not surprising as there are many compressions, and each can behave differently on different data types (strings, ints, floats, rows, columns) . notice the scan time is more or less the same.

I ran a columnar test on a specific int column instead of all columns in this queries, and the results were the same, as all compressions read the same amount of giga scanning this specific column. again notice the different running time.

SELECT avg( l_orderkey) FROM “tpch_data”.”lineitem_parquet”

  • Parquet, GZIP:  (Run time: 6.84 seconds, Data scanned: 3.77GB)
  • Parquet, BZIP:  (Run time: 9.03 seconds, Data scanned: 3.77GB)
  • Parquet, Deflate: (Run time: 9 seconds, Data scanned: 3.77GB)
  • Parquet, LZO: (Run time: 9.94 seconds, Data scanned: 3.77GB)
  • ORC, Zlib : (Run time: 11.27 seconds, Data scanned: 3.77GB)

Again, This test should not convince you to prefer one storage type over the other, nor to prefer one compression type over the other. Test it on you data, and understand the differences in your use case.

and don’t forget monthly AWS s3 Storage costs…. 

  1. Old data can be reduced to reduced availability.
  2. how much data is read in your queries. if your invoice of Athena is 500$, this means you are reading 100 TB of compressed data. so… can you minimised the amount of data read in your query – or can your reduce the history from 2 years to 1 year ?

Conclusion on cost reduction using AWS SQL Athena

  1. As you can see, you could be saving a 50% or more. easily on your AWS SQL Athena costs simply by changing to the correct compression.
  2. Check the running time, be sure it is a non issues for your use case.
  3. regarding the text vs parquet, be sure to understand the use-case, not always you need to extract all the rows, thus column based storage, will be more use-full.
  4. if you have any more tips to reduce costs – please contact me if you are willing to share 🙂

Resources:

https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

https://docs.aws.amazon.com/athena/latest/ug/service-limits.html

https://docs.aws.amazon.com/athena/latest/ug/work-with-data.html

Need to learn more about aws big data?



——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/



architecture, AWS athena, AWS EMR

When should we use EMR and When should we use Redshift? EMR VS Redshift

Use Redshift when

  1. Traditional data warehouse
  2. When you need the data relatively hot for analytics such as BI
  3. When there is no data engineering team
  4. When your queries require joins
  5. When you need a cluster 24X7
  6. When you data type are simple, i.e not Arrays, or Structs
  7. When data has no nested jsons
  8. When you have petabyte scale database
  9. When you want analize massive amount of data (spectrum)
  10. When you need update/delete
  11. When you require and ACID DBMS

Use EMR (SparkSQL, Presto, hive) when

  1. When you need a transient cluster, for night or hourly automation 
  2. When compute elasticity is important (auto scaling on tasks)
  3. When cost is important: spot instances. 
  4. When you data scales until a few hundred TB’s
  5. When you want to decouple compute and storage (external table + task node + auto scaling). this is cloud architecture best practice.
  6. When you require more flexibility
    1. Complex partitions + dynamic partitioning + insert overwrite. click on the link for an example.
    2. Complex data type
      1. Structs
      2. Arrays <–> nested json
    3. Orchestration built in such as Oozie, although Airflow is more common.
    4. Notebook built in – mix your code with SQL via   Zeppelin

Watch this meetup video to understand in depth Big Data Architecture conciderations in AWS.

Please check below Redshift specific faq: 

Q: When would I use Amazon Redshift vs. Amazon EMR?
Q: Can Redshift Spectrum replace Amazon EMR?
Q: Can I use Redshift Spectrum to query data that I process using Amazon EMR?

— Reference : Redshift faq
https://aws.amazon.com/redshift/faqs/

Please check below EMR specific faq:

Q: What can I do with Amazon EMR?
Q: Who can use Amazon EMR?
Q: What can I do with Amazon EMR that I could not do before?
Q: What is the data processing engine behind Amazon EMR?
Q: What is Apache Spark?
Q: What is Presto?

— Reference : EMR faq
https://aws.amazon.com/emr/faqs/

** Point 2. I am listing other resources which can help to understand RDS and EMR use cases better.

— Reference :
AWS redshift related case studies > Look for case study section :
https://aws.amazon.com/redshift/getting-started/
https://pages.awscloud.com/redshift-proof-of-concept-request.html

— Reference :
AWS EMR related case studies > Look for case study section :
https://aws.amazon.com/emr/
https://pages.awscloud.com/GLOBAL_OT_emr-poc_20170530.html

** Point 3. I have tried to check some of AWS blogs which shows how EMR and RDS can be used together in specific use cases. 

— How I built a data warehouse using Amazon Redshift and AWS services in record time
https://aws.amazon.com/blogs/big-data/how-i-built-a-data-warehouse-using-amazon-redshift-and-aws-services-in-record-time/

— Build a Healthcare Data Warehouse Using Amazon EMR, Amazon Redshift, AWS Lambda, and OMOP
https://aws.amazon.com/blogs/big-data/build-a-healthcare-data-warehouse-using-amazon-emr-amazon-redshift-aws-lambda-and-omop/

— Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning
https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/

Hope this information helps in understanding EMR and Redshift use cases better.

Need to learn more about aws big data?



——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/