Reasons to work with AVRO
- good for nested data structures even in analytics use cases
- good for Google Big Query, for import
- good for Google Big Query as an external source which is not CSV.
Reasons NOT to work with AVRO
- Row based , generally, not good for analytics.
- Doesn’t not support GZIP.
Several ways to create AVRO table
CREATE TABLE doctors
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{
"namespace": "testing.hive.avro.serde",
"name": "doctors",
"type": "record",
"fields": [
{
"name":"number",
"type":"int",
"doc":"Order of playing the role"
},
{
"name":"first_name",
"type":"string",
"doc":"first name of actor playing role"
},
{
"name":"last_name",
"type":"string",
"doc":"last name of actor playing role"
},
{
"name":"extra_field",
"type":"string",
"doc:":"an extra field not in the original file",
"default":"fishfingers and custard"
}
]
}');
Another way
CREATE TABLE my_avro_table(notused INT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url'='file:///tmp/schema.avsc')
STORED as INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
AVRO with compression and partitions example:
SET hive.exec.compress.output=true; SET avro.output.codec=snappy;
be sure to test the size of the output to see if matches your needs 🙂 snappy is faster, deflate is slightly more compact. 🙂
CREATE external TABLE dfp_test_partitions_avro
(
`DATE` date ,
`ADVERTISER_NAME` STRING ,
`ORDER_ID` int ,
`ORDER_NAME` STRING ,
`AD_UNIT_NAME` STRING ,
`COUNTRY_NAME` STRING ,
`LINE_ITEM_NAME` STRING ,
`CREATIVE_SIZE` STRING ,
`CREATIVE_SIZE_DELIVERED` STRING ,
`LINE_ITEM_TYPE` STRING ,
`DAY` int ,
`ADVERTISER_ID` int ,
`AD_UNIT_ID` int ,
`COUNTRY_CRITERIA_ID` int ,
`LINE_ITEM_ID` int ,
`LINE_ITEM_GOAL_QUANTITY` STRING ,
`LINE_ITEM_LIFETIME_IMPRESSIONS` STRING ,
`TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS` int ,
`TOTAL_LINE_ITEM_LEVEL_CLICKS` int ,
`TOTAL_LINE_ITEM_LEVEL_CPM_AND_CPC_REVENUE` decimal )STORED AS AVRO
PARTITIONED BY (dt string)
LOCATION ‘gs://omid-eu-west-1/dfp_gzip_avro_partition/
contact me for more details or questions (would love to help)
- Omid Vahdaty
- +972–54–2384178
Want more big data quality content? Join our meetup, subscribe to youtube channels
- https://www.meetup.com/AWS-Big-Data-Demystified/
- https://www.meetup.com/Big-Data-Demystified/
- Big Data Demystified YouTube
- AWS Big Data Demystified YouTube
- https://big-data-demystified.ninja/
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:
