In Athena, perform the below steps :
- create schema
- create a table with a pipe separator and location to AWS s3 public bucket with the TPCH 1TB raw data. notice the storage type is text GZIP.
- test the create table via running any query with limit 2 to see data correctness.
- create destination table of parquet storage, Gzip and your own bucket path. notice the slight minor in data types from date to timestamp. this is due to the parquet.
- count the amount rows via Athena on the source table, measure the running time.
Switch to Hive to run the below:
- insert overwrite the data from the source table to the destination table (this will takes a few hours depending on your cluster size) this query should run on Hive. if you are not using partitions, you may want to use SparkSQL as well.
- Notice: spark SQL and Hive behave differently on partitions – so be advised.
Switch back to Athena
- count the amount rows via Athena on the destination table.
- measure the running time and see the difference of running on the source table and destination table.
The Queries:
creating the schema
CREATE SCHEMA IF NOT EXISTS tpch_data
Creating the source raw data table on s3 bucket
CREATE external TABLE tpch_data.lineitem (
l_orderkey int ,
l_partkey int ,
l_suppkey int ,
l_linenumber int ,
l_quantity double ,
l_extendedprice double ,
l_discount double ,
l_tax double ,
l_returnflag string ,
l_linestatus string ,
l_shipdate date ,
l_commitdate date ,
l_receiptdate date ,
l_shipinstruct string ,
l_shipmode string ,
l_comment string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\|‘
LOCATION ‘s3://redshift-demo/tpc-h/1024/lineitem/‘;
Confirm data is loaded correctly( if wont see data, you probably missed something in the create table such location, delimiter etc) :
select * from tpch_data.lineitem limit 2;
select count(*) from tpch_data.lineitem;
Create destination table in parquet storage (notice the end of the create: Stored as, location, compression)
CREATE external TABLE tpch_data.lineitem_parquet (
l_orderkey int ,
l_partkey int ,
l_suppkey int ,
l_linenumber int ,
l_quantity double ,
l_extendedprice double ,
l_discount double ,
l_tax double ,
l_returnflag string ,
l_linestatus string ,
l_shipdate timestamp ,
l_commitdate timestamp ,
l_receiptdate timestamp ,
l_shipinstruct string ,
l_shipmode string ,
l_comment string
)
STORED as PARQUET
LOCATION ‘s3://myBucket-demo/tpc-h/1024/lineitem/‘
TBLPROPERTIES (“parquet.compress”=”GZIP“)
Convert to columnar via Hive only(!)
INSERT OVERWRITE TABLE tpch_data.lineitem_parquet SELECT * FROM tpch_data.lineitem
Confirm target data on Athena again…
select count(*) from tpch_data.lineitem_parquet;
Optional – read the data from Redshift spectrum with no insert:
Need to learn more about aws big data?
- Contact me via linked in Omid Vahdaty
- website: https://amazon-aws-big-data-demystified.ninja/
- Join our meetup, FB group and youtube channel
- Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
- Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
- subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:
https://www.linkedin.com/in/omid-vahdaty/
2 thoughts on “Converting TPCH data from row based to columnar Via Hive or SparkSQL and run ad hoc queries via Athena on columnar data”