AWS, AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Lambda, AWS Redshift, Hive, meetup, Uncategorised

200KM/h overview on Big Data in AWS | Part 2

in this lecture we are going to cover AWS Big Data PaaS technologies used to model and visualize data using a suggested architecture and some basic big data architecture rule of thumbs.

For more meetups:
https://www.meetup.com/Big-Data-Demystified/

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS, AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Lambda, AWS Redshift, Hive

200KM/h overview on Big Data in AWS | Part 1

in this lecture we are going to cover AWS Big Data PaaS technologies used to ingest and transform data. Moreover, we are going to demonstrate a business use case, suggested architecture, some basic big data architecture rule of thumbs.

For more meetups:
https://www.meetup.com/Big-Data-Demystified/

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS EMR, Hive

Cherry pick source files in Hive external table example

Cool way to filter files on your bucket for an external table on hive !

CREATE EXTERNAL TABLE mytable1 ( a string, b string, c string )
STORED AS TEXTFILE
LOCATION 's3://my.bucket/' 'folder/2009.*\.bz2$';

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

Hive

Working with Avro on Hadoop Hive SQL

Reasons to work with  AVRO

  1. good for nested data structures even in analytics use cases
  2. good for Google Big Query, for import
  3. good for Google Big Query as an external source which is not CSV.

Reasons NOT to work with AVRO

  1. Row based , generally, not good for analytics.
  2. Doesn’t not support GZIP.

 

Several ways to create AVRO table

CREATE TABLE doctors
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{
  "namespace": "testing.hive.avro.serde",
  "name": "doctors",
  "type": "record",
  "fields": [
    {
      "name":"number",
      "type":"int",
      "doc":"Order of playing the role"
    },
    {
      "name":"first_name",
      "type":"string",
      "doc":"first name of actor playing role"
    },
    {
      "name":"last_name",
      "type":"string",
      "doc":"last name of actor playing role"
    },
    {
      "name":"extra_field",
      "type":"string",
      "doc:":"an extra field not in the original file",
      "default":"fishfingers and custard"
    }
  ]
}');

Another way

CREATE TABLE my_avro_table(notused INT)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  WITH SERDEPROPERTIES (
    'avro.schema.url'='file:///tmp/schema.avsc')
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

 

AVRO with compression and partitions example:

SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

be sure to test the size of the output to see if matches your needs 🙂 snappy is faster, deflate is slightly more compact. 🙂

CREATE external TABLE dfp_test_partitions_avro
(
`DATE` date ,
`ADVERTISER_NAME` STRING ,
`ORDER_ID` int ,
`ORDER_NAME` STRING ,
`AD_UNIT_NAME` STRING ,
`COUNTRY_NAME` STRING ,
`LINE_ITEM_NAME` STRING ,
`CREATIVE_SIZE` STRING ,
`CREATIVE_SIZE_DELIVERED` STRING ,
`LINE_ITEM_TYPE` STRING ,
`DAY` int ,
`ADVERTISER_ID` int ,
`AD_UNIT_ID` int ,
`COUNTRY_CRITERIA_ID` int ,
`LINE_ITEM_ID` int ,
`LINE_ITEM_GOAL_QUANTITY` STRING ,
`LINE_ITEM_LIFETIME_IMPRESSIONS` STRING ,
`TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS` int ,
`TOTAL_LINE_ITEM_LEVEL_CLICKS` int ,
`TOTAL_LINE_ITEM_LEVEL_CPM_AND_CPC_REVENUE` decimal )

STORED AS AVRO
PARTITIONED BY (dt string)
LOCATION ‘gs://omid-eu-west-1/dfp_gzip_avro_partition/

 

contact me for more details or questions (would love to help)

Want more big data quality content? Join our meetup, subscribe to youtube channels

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

Hive

AWS EMR Hive Create External table with Dynamic partitioning transformation job example in SQL

This is an example to understand the power of Dynamic partitioning in Hive

set hive.exec.dynamic.partition.mode=nonstrict;

CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.myTable(

  GAID string,

  leave_timestamp string,

  latitude  string,

  longitude string,

  stay_time string,

  country string,

  city string,

  Street string,

  house string,

Home_Country string,

Home_City string,

Home_Neighborhood string,

Home_Zip_Code string,

Office_Country string,

Office_City string,

Office_Neighborhood string,

Office_Zip_Code string,

Zip_of_the_location string,

POI_Name string,

POI_Type string,

POI_Tag_Value string,

Altitude string,

Server_Version string,

Ver_No string)

PARTITIONED BY (dt string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’

LOCATION ‘s3://myBucket/production/rawdata/online/‘ ;

MSCK REPAIR TABLE sampledb.myTable;

    CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.myTable_parquet(

    GAID string,

    leave_timestamp string,

    latitude  string,

    longitude string,

    stay_time string,

    country string,

    city string,

    Street string,

    house string,

    Home_Country string,

    Home_City string,

    Home_Neighborhood string,

    Home_Zip_Code string,

    Office_Country string,

    Office_City string,

    Office_Neighborhood string,

    Office_Zip_Code string,

    Zip_of_the_location string,

    POI_Name string,

    POI_Type string,

    POI_Tag_Value string,

    Altitude string,

    Server_Version string,

    Ver_No string

    )

    PARTITIONED BY (dt string)

    STORED AS PARQUET

    LOCATION ‘s3://myBucket/production/parquetdata/online_new/‘ ;

MSCK REPAIR TABLE sampledb.myTable_parquet;

INSERT OVERWRITE TABLE sampledb.myTable_parquet  partition (dt)

SELECT

regexp_replace(gaid, ‘”‘,”),

regexp_replace(leave_timestamp, ‘”‘,”) ,

regexp_replace(latitude, ‘”‘,”),

regexp_replace(longitude, ‘”‘,”) ,

regexp_replace(stay_time, ‘”‘,”) ,

regexp_replace(country, ‘”‘,”),

regexp_replace(regexp_replace(city, ‘”‘,”), “‘”,””),

regexp_replace(street, ‘”‘,”),

regexp_replace(house, ‘”‘,”),

regexp_replace(Home_Country, ‘”‘,”),

regexp_replace(regexp_replace(home_city, ‘”‘,”), “‘”,””),

regexp_replace(Home_Neighborhood, ‘”‘,”),

regexp_replace(Home_Zip_Code, ‘”‘,”),

regexp_replace(Office_Country, ‘”‘,”),

regexp_replace(regexp_replace(office_city, ‘”‘,”), “‘”,””),

regexp_replace(Office_Neighborhood, ‘”‘,”),

regexp_replace(Office_Zip_Code, ‘”‘,”),

regexp_replace(Zip_of_the_location, ‘”‘,”),

regexp_replace(POI_Name, ‘”‘,”),

regexp_replace(POI_Type, ‘”‘,”),

regexp_replace(POI_Tag_Value, ‘”‘,”),

regexp_replace(Altitude, ‘”‘,”),

regexp_replace(server_version, ‘”‘,”),

regexp_replace(ver_no, ‘”‘,”),

  dt

FROM sampledb.myTable WHERE dt>=current_date() – interval ‘3’ day;

Other good hive create external table examples:

https://community.hortonworks.com/questions/28856/hive-table-format-and-compression.html

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/