Reasons to work with AVRO
- good for nested data structures even in analytics use cases
- good for Google Big Query, for import
- good for Google Big Query as an external source which is not CSV.
Reasons NOT to work with AVRO
- Row based , generally, not good for analytics.
- Doesn’t not support GZIP.
Several ways to create AVRO table
CREATE TABLE doctors ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.literal'='{ "namespace": "testing.hive.avro.serde", "name": "doctors", "type": "record", "fields": [ { "name":"number", "type":"int", "doc":"Order of playing the role" }, { "name":"first_name", "type":"string", "doc":"first name of actor playing role" }, { "name":"last_name", "type":"string", "doc":"last name of actor playing role" }, { "name":"extra_field", "type":"string", "doc:":"an extra field not in the original file", "default":"fishfingers and custard" } ] }');
Another way
CREATE TABLE my_avro_table(notused INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ( 'avro.schema.url'='file:///tmp/schema.avsc') STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
AVRO with compression and partitions example:
SET hive.exec.compress.output=true; SET avro.output.codec=snappy;
be sure to test the size of the output to see if matches your needs 🙂 snappy is faster, deflate is slightly more compact. 🙂
CREATE external TABLE dfp_test_partitions_avro
(
`DATE` date ,
`ADVERTISER_NAME` STRING ,
`ORDER_ID` int ,
`ORDER_NAME` STRING ,
`AD_UNIT_NAME` STRING ,
`COUNTRY_NAME` STRING ,
`LINE_ITEM_NAME` STRING ,
`CREATIVE_SIZE` STRING ,
`CREATIVE_SIZE_DELIVERED` STRING ,
`LINE_ITEM_TYPE` STRING ,
`DAY` int ,
`ADVERTISER_ID` int ,
`AD_UNIT_ID` int ,
`COUNTRY_CRITERIA_ID` int ,
`LINE_ITEM_ID` int ,
`LINE_ITEM_GOAL_QUANTITY` STRING ,
`LINE_ITEM_LIFETIME_IMPRESSIONS` STRING ,
`TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS` int ,
`TOTAL_LINE_ITEM_LEVEL_CLICKS` int ,
`TOTAL_LINE_ITEM_LEVEL_CPM_AND_CPC_REVENUE` decimal )STORED AS AVRO
PARTITIONED BY (dt string)
LOCATION ‘gs://omid-eu-west-1/dfp_gzip_avro_partition/
contact me for more details or questions (would love to help)
- Omid Vahdaty
- +972–54–2384178
Want more big data quality content? Join our meetup, subscribe to youtube channels
- https://www.meetup.com/AWS-Big-Data-Demystified/
- https://www.meetup.com/Big-Data-Demystified/
- Big Data Demystified YouTube
- AWS Big Data Demystified YouTube
- https://big-data-demystified.ninja/
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me: