Hive

Working with Avro on Hadoop Hive SQL

Reasons to work with  AVRO

  1. good for nested data structures even in analytics use cases
  2. good for Google Big Query, for import
  3. good for Google Big Query as an external source which is not CSV.

Reasons NOT to work with AVRO

  1. Row based , generally, not good for analytics.
  2. Doesn’t not support GZIP.

 

Several ways to create AVRO table

CREATE TABLE doctors
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{
  "namespace": "testing.hive.avro.serde",
  "name": "doctors",
  "type": "record",
  "fields": [
    {
      "name":"number",
      "type":"int",
      "doc":"Order of playing the role"
    },
    {
      "name":"first_name",
      "type":"string",
      "doc":"first name of actor playing role"
    },
    {
      "name":"last_name",
      "type":"string",
      "doc":"last name of actor playing role"
    },
    {
      "name":"extra_field",
      "type":"string",
      "doc:":"an extra field not in the original file",
      "default":"fishfingers and custard"
    }
  ]
}');

Another way

CREATE TABLE my_avro_table(notused INT)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  WITH SERDEPROPERTIES (
    'avro.schema.url'='file:///tmp/schema.avsc')
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

 

AVRO with compression and partitions example:

SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

be sure to test the size of the output to see if matches your needs 🙂 snappy is faster, deflate is slightly more compact. 🙂

CREATE external TABLE dfp_test_partitions_avro
(
`DATE` date ,
`ADVERTISER_NAME` STRING ,
`ORDER_ID` int ,
`ORDER_NAME` STRING ,
`AD_UNIT_NAME` STRING ,
`COUNTRY_NAME` STRING ,
`LINE_ITEM_NAME` STRING ,
`CREATIVE_SIZE` STRING ,
`CREATIVE_SIZE_DELIVERED` STRING ,
`LINE_ITEM_TYPE` STRING ,
`DAY` int ,
`ADVERTISER_ID` int ,
`AD_UNIT_ID` int ,
`COUNTRY_CRITERIA_ID` int ,
`LINE_ITEM_ID` int ,
`LINE_ITEM_GOAL_QUANTITY` STRING ,
`LINE_ITEM_LIFETIME_IMPRESSIONS` STRING ,
`TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS` int ,
`TOTAL_LINE_ITEM_LEVEL_CLICKS` int ,
`TOTAL_LINE_ITEM_LEVEL_CPM_AND_CPC_REVENUE` decimal )

STORED AS AVRO
PARTITIONED BY (dt string)
LOCATION ‘gs://omid-eu-west-1/dfp_gzip_avro_partition/

 

contact me for more details or questions (would love to help)

Want more big data quality content? Join our meetup, subscribe to youtube channels

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

Leave a Reply