Exporting data from Google BigQuery into AWS Athena
Author: Omid Vahdaty 27.5.2018
Perquisites
- GCP basics – create machine, or use cloud shell.
- Big query basics – export data
- An instance from GCP.
- gsutil – google cloud storage utility – copy/ls etc.
- S3 bucket + with user (access key + secret key)
- Avro tools
- Java
The motivation
Export the data with the schema, as the structure is highly nested and includes complex data types.
Steps to reconstruct export data from google BigQuery into AWS S3 + EMR Hive or Athena:
- From GCP machine run export from big query to google cloud storage (notice the destination format):
bq --location=US extract --destination_format AVRO --compression SNAPPY `gap:123.ga_sessions_20190129` gs://your-gs-bucket/file*.avro
- From the GCP machine edit ~/.boto file to add your aws credentials to gsutil as follows.
or try the following:
1.
gcloud auth login
2. edit ~/.boto file, add the AWS access key , secret key
3. From GCP machine run the gsutil to copy the data from google cloud storage to AWS S3:
gsutil rsync -d -r gs://your-gs-bucket s3://your-aws-bucket
4. From AWS run the following command to get the inlined schema file :
java -jar ~/avro-tools-1.7.7.jar getschema myAvro_20180526.avro > myAvro_20180526.avsc
5. Copy the myAvro_20180526.avsc file in a bucket on AWS.
6. In Hive @AWS create table as follows:
CREATE EXTERNAL TABLE g
STORED AS AVRO
LOCATION ‘s3://your-aws-bucket/‘ TBLPROPERTIES (‘avro.schema.url’=’s3://your-aws-bucket/avsc/g.avsc‘);
7. Run in Hive at AWS query to confirm data.
select * from g limit 1;
8. Run in Hive at AWS the below to see table attributes.
show create table g;
Alternate method to create table on top of the Avro:
- Make sure the AWS bucket name is in the example name of your table.
- Use AWS Glue wizard to crawl the data and infer the schema on the fly via Athena.
Why you should not try to create the table your self (manually):
Here the original spec of the schema: https://support.google.com/analytics/answer/3437719?hl=en
As you can see there are multiple duplicate columns names, making extremely hard to parse the data. See hits level data and user level data,both will have similar column names.
Traversing the data, assuming complex array data structure (non trivial):
Example of traversing google analytics ga_sessions table in AWS Athena:
Athena has a ROW data type which the equivalent of struct in BigQuery
select hits_page.hostname from
(
SELECT
visitnumber,
--TRY(hits[2]) AS hit_row2,
--TRY(hits[3]) AS hit_row3,
hits[1].page as page,
hits[1].transaction as transaction,
CAST(hits[1].page AS ROW(pagePath VARCHAR,
hostname VARCHAR,
pageTitle VARCHAR,
searchKeyword VARCHAR,
searchcategory VARCHAR,
pagePathLevel1 VARCHAR,
pagePathLevel2 VARCHAR,
pagePathLevel3 VARCHAR,
pagePathLevel4 VARCHAR
)) AS hits_page
FROM ga_session_table)
where hits_page.hostname !=''
limit 100
Another way to query nested data in Athena:
SELECT hit.page.pagepathlevel1 FROM omid_ga_session_test CROSS JOIN UNNEST(hits) AS t (hit);
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,
feel free to contact me via LinkedIn:
1 thought on “How to export data from Google Big Query into AWS S3 + EMR hive or AWS Athena”