Hive

Working with Avro on Hadoop Hive SQL

Reasons to work with  AVRO

  1. good for nested data structures even in analytics use cases
  2. good for Google Big Query, for import
  3. good for Google Big Query as an external source which is not CSV.

Reasons NOT to work with AVRO

  1. Row based , generally, not good for analytics.
  2. Doesn’t not support GZIP.

 

Several ways to create AVRO table

CREATE TABLE doctors
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{
  "namespace": "testing.hive.avro.serde",
  "name": "doctors",
  "type": "record",
  "fields": [
    {
      "name":"number",
      "type":"int",
      "doc":"Order of playing the role"
    },
    {
      "name":"first_name",
      "type":"string",
      "doc":"first name of actor playing role"
    },
    {
      "name":"last_name",
      "type":"string",
      "doc":"last name of actor playing role"
    },
    {
      "name":"extra_field",
      "type":"string",
      "doc:":"an extra field not in the original file",
      "default":"fishfingers and custard"
    }
  ]
}');

Another way

CREATE TABLE my_avro_table(notused INT)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  WITH SERDEPROPERTIES (
    'avro.schema.url'='file:///tmp/schema.avsc')
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

 

AVRO with compression and partitions example:

SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

be sure to test the size of the output to see if matches your needs 🙂 snappy is faster, deflate is slightly more compact. 🙂

CREATE external TABLE dfp_test_partitions_avro
(
`DATE` date ,
`ADVERTISER_NAME` STRING ,
`ORDER_ID` int ,
`ORDER_NAME` STRING ,
`AD_UNIT_NAME` STRING ,
`COUNTRY_NAME` STRING ,
`LINE_ITEM_NAME` STRING ,
`CREATIVE_SIZE` STRING ,
`CREATIVE_SIZE_DELIVERED` STRING ,
`LINE_ITEM_TYPE` STRING ,
`DAY` int ,
`ADVERTISER_ID` int ,
`AD_UNIT_ID` int ,
`COUNTRY_CRITERIA_ID` int ,
`LINE_ITEM_ID` int ,
`LINE_ITEM_GOAL_QUANTITY` STRING ,
`LINE_ITEM_LIFETIME_IMPRESSIONS` STRING ,
`TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS` int ,
`TOTAL_LINE_ITEM_LEVEL_CLICKS` int ,
`TOTAL_LINE_ITEM_LEVEL_CPM_AND_CPC_REVENUE` decimal )

STORED AS AVRO
PARTITIONED BY (dt string)
LOCATION ‘gs://omid-eu-west-1/dfp_gzip_avro_partition/

 

contact me for more details or questions (would love to help)

Want more big data quality content? Join our meetup, subscribe to youtube channels

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s