Data Engineering, Hive, Spark

Converting TPCH data from row based to columnar Via Hive or SparkSQL and run ad hoc queries via Athena on columnar data

In Athena, perform the below steps :

  1. create schema
  2. create a table with a pipe separator and location to AWS s3 public bucket with the TPCH 1TB raw data. notice the storage type is text GZIP.
  3. test the create table via running any query with limit 2 to see data correctness.
  4. create destination table of parquet storage, Gzip and your own bucket path. notice the slight minor in data types from date to timestamp. this is due to the parquet.
  5. count the amount rows via Athena on the source table, measure the running time.

Switch to Hive to run the below:

  • insert overwrite the data from the source table to the destination table (this will takes a few hours depending on your cluster size) this query should run on Hive. if you are not using partitions, you may want to use SparkSQL as well.
  • Notice: spark SQL and Hive behave differently on partitions – so be advised.

Switch back to Athena

  • count the amount rows via Athena on the destination table.
  • measure the running time and see the difference of running on the source table and destination table.

 

The Queries:

creating the schema

CREATE SCHEMA IF NOT EXISTS  tpch_data

Creating the source raw data table on s3 bucket

CREATE external TABLE tpch_data.lineitem (

 l_orderkey int                     ,

 l_partkey int                                         ,

 l_suppkey int                                         ,

 l_linenumber int                                     ,

 l_quantity double                               ,

 l_extendedprice double                          ,

 l_discount double                              ,

 l_tax double                              ,

 l_returnflag string                                   ,

 l_linestatus string                                 ,

 l_shipdate date                            ,

 l_commitdate date                                   ,

 l_receiptdate date                               ,

 l_shipinstruct string             ,

 l_shipmode string                 ,

 l_comment string          

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘\|

LOCATION ‘s3://redshift-demo/tpc-h/1024/lineitem/‘;

 

Confirm data is loaded correctly( if wont see data, you probably missed something in the create table such location, delimiter etc) :

select * from tpch_data.lineitem limit 2;

select count(*) from tpch_data.lineitem;

 

Create destination table in parquet storage (notice the end of the create: Stored as, location, compression)

CREATE external TABLE tpch_data.lineitem_parquet (

 l_orderkey int                     ,

 l_partkey int                                         ,

 l_suppkey int                                         ,

 l_linenumber int                                     ,

 l_quantity double                               ,

 l_extendedprice double                          ,

 l_discount double                              ,

 l_tax double                              ,

 l_returnflag string                                   ,

 l_linestatus string                                 ,

 l_shipdate timestamp                            ,

 l_commitdate timestamp                                   ,

 l_receiptdate timestamp                               ,

 l_shipinstruct string             ,

 l_shipmode string                 ,

 l_comment string          

)

STORED as PARQUET

LOCATION ‘s3://myBucket-demo/tpc-h/1024/lineitem/

TBLPROPERTIES (“parquet.compress”=”GZIP“)

 

Convert to columnar via Hive only(!)

INSERT OVERWRITE TABLE  tpch_data.lineitem_parquet SELECT * FROM tpch_data.lineitem

Confirm target data on Athena again…

select count(*) from tpch_data.lineitem_parquet;  

 

Optional – read the data from Redshift spectrum with no insert:

https://amazon-aws-big-data-demystified.ninja/2018/05/15/want-to-select-data-on-redshift-spectrum-which-was-created-at-athena/

Need to learn more about aws big data?



——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/



2 thoughts on “Converting TPCH data from row based to columnar Via Hive or SparkSQL and run ad hoc queries via Athena on columnar data”

Leave a Reply