architecture, Big Query

When to use BigQuery? When not to use BigQuery?

When to use BigQuery?

  1. 80% of all data use case would be a good reason to use BigQuery.
  2. If you are not sure – start with BigQuery
  3. If you are using google analytics or firebase
  4. When you have a director connector that load the data for you to BigQuery
  5. When your production is in GCP

When not to use BigQuery [Notice the below are the other 20% or less – Extreme cases which are not common]:

  1. When are processing more than one 5GB compressed file per hour (load limit of GCP BigQuery per file). Try opening a compressed file of 6GB and splitted into smaller files… your one hour window of time will be shining as the file uncompressed into something huge of ±20 GB ….
  2. When you are hitting an error in BigQuery that says your query is consuming too much resources, ona weekly basis, and there nothing you can do about that.
  3. When you need to self Join Billion of records on a regular basis.
  4. When you you are using complex Window functions. You are likely to get an error of too many resource are being user for your Query and there nothing you can do except rewriting your query.

So what are the alternative to BigQuery:

  1. Hadoop ecosystem: Data Proc / Cloudera
  2. SQream DB , a database designed to handle huge files, massive joins at surprising amount of speed, simplicity and cost effectiveness.

apache, architecture, AWS, AWS EMR, EMR, Hive, presto, Spark, zeppelin

AWS EMR and Hadoop Demystified – Comprehensive training program suggestion for Data Engineers in 200KM/h

This blog Assumes prior knowledge, this to help the reader design training program to newbies on AWS EMR Hadoop. Naturally, my big data perspective is applied here. This blog is FAR FROM BEING PERFECT.

Learn the following in rising order of importance (in my humble opinion).

Quick introduction to big data in 200 KM/h

Beyond the basics….

Hive vs presto Demystified

Hive Demystified

EMR Zeppelin & Zeppelin

EMR Yarn Demystified

EMR Spark Demystified

EMR Livy demystified

EMR Spark and Zeppelin demystified

Rstudio and SparkR demystified

EMR spark Application logging

EMR Monitoring Demystified | EMR Ganglia

EMR spark tuning demystified

EMR Oozie demystified (not common, use airflow instead)

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

architecture, AWS, AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Redshift, AWS S3, EMR, SageMaker, Security

AWS Demystified – Comprehensive training program suggestion for newbies in 200KM/h

This blog Assume prior knowledge, this to help the reader design training program to newbies on AWS. Naturally, my big data perspective is applied here.

Learn the following in rising order of importance (in my humble opinion).

General lightweight introduction to AWS & AWS Big Data :

  1. Create AWS user on AWS account

Start with this. get the obvious out of the way.

  1. AWS S3 (GCS), AWS S3 cli

Be sure to understand the basics, upload, download, copy, move, rsnc from both the GUI and AWS CLI. only then go to other advanced features such as life cycle policy, storage tiers, encyrptions etc.

  1. AWS EC2(Elastic Compute), how to create a machine, how to connect via ssh

Be sure to use T instance to play a round, choose amazon linux or ubuntu. Notice the different OS users name required to ssh to each machine.

Be sure to understand what is

  1. SSH
  2. SSH tunnel
  1. AWS security groups how to add ip and port

Without this section you wont be to access web/ssh machines.

  1. AWS VPC (virtual private network) , only if you feel comfortable around network architecture, otherwise skipt this topic.
  1. AWS RDS (mySQL,aurora)

create Mysql , login, create table, insert data from S3, export data from s3.

understand the difference between AWS RDS aurora and AWS RDS mysql

  1. AWS Redshift learn how to create cluster connect to cluster, and run query , understand the basic architecture.
  2. AWS Athena, how to create external table, how to insert data, partitions, MSCK repair table, parquet, AVRO, querying nested data.
  1. AWS Glue

be sure to understand the 2 role of AWS glue: shared metastore and auto ETL features.

  1. AWS Kinesis, Stream, Firehose, Analytics, you need to understand messaging and streaming. I covered off topic subject here such as flume and kafka.
  1. AWS EMR – (need to talk about with omid).

This is HARD CORE material. highly advanced materials for enterprise grade lectures.

Be sure to understand when to use Hive and when to use SparkSQL and when to use Spark Core. Moreover, the difference between the different node: core node, task node, master node.

  1. AWS IAM, groups, user, role based architecture, encryption at rest, at motion, based policy, identity based policy. By now you should have a basic understanding of what is IAM about. Identity, Authentication, authorization. although, there are many fine grain security issues you need to understand. At the very minimum be sure to understand what is the difference between a role and user, how to write a custom policy, and what is resource based policy vs identity based policy.
  1. AWS Cloud best practices, light wight
  1. AWS ELB, ALB, Auto scaling. [Advanced]
  1. AWS Route53, TBD
  2. AWS security
  1. AWS Lambda
  2. AWS Sage Maker

TBD subject:

API Gateway

AWS cognito.


——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

airflow, architecture, Big Query, cost reduction

DFP Data Transfer Files Use Case | BigQuery 93% Cost Reduction demystified

DFP Data Transfer Files Use Case | BigQuery 93% Cost Reduction demystified

DFP Data Transfer Files Use Case :

DFP data transfer files delivers impression level data files on a GCS bucket. The amount of raw data is highly correlated with your web traffic. Big data challenge!

The Big data problems with the DFP data transfer file format being delivered:

  1. data transfer file names contain the timestamp of server timezone instead of your local time.
  2. Your time zone is in a columns called time. The column contains unsupported datetime format: ‘2019-11-12-20:15:17’ , notice the “-” between date and time.
  3. There are several file types per table: Network, and network BackFill each of which per table – impressions and clicks. i.e 4 filenames per hour in the same bucket (not in different folder)
  4. all the files of all the tables we supplied in the same bucket….

So… What is the problem with DFP data transfer files?

  1. You can’t load the data simply b/c the there is no date column to partition the data with. so using a unpartitioned table loading all the data nightly, and then repartition the data to another table. is doable – BUT VERY COSTLY.
  2. You can ‘t transform the data on GCS level. (we are talking Serious amount of TB’s).
  3. How can you orchestrate it? how will you make the process incremental? instead of each night loading EVERYTHING.

So what is the Solution to load DFP data transfer files?

BigQuery Sharded table loading via Airflow Dynamic Workflow , template_fields and Loop :

  1. Load incrementally (every day ) each 24 files with same date to a different table shard of one day (similar to GA_Sessions_* table) . base the shard date on the date from the file name
  2. Use Airflow to load data to via dynamic workflow.
  3. Once you have the table, create a view on top of the sharded table with the correct dates.
  4. Transform the data incrementally.

Airflow Dynamic workflow with loop over operators and templated fields

also committed in our Git

import datetime
import os
import logging
from airflow import DAG
from airflow import models
from airflow.contrib.operators import bigquery_to_gcs
from airflow.contrib.operators import gcs_to_bq
#from airflow.operators import dummy_operator
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators import BashOperator
# Import operator from plugins
from airflow.contrib.operators import gcs_to_gcs
from airflow.utils import trigger_rule
# Output file for job.
output_file = os.path.join(
    models.Variable.get('gcs_bucket'), 'MyBucket',
    datetime.datetime.now().strftime('%Y%m%d-%H%M%S')) + os.sep
# Path to GCS buckets. no need to add gs://
DST_BUCKET = ('MyBucket')
 
yesterday = datetime.datetime.combine(
    datetime.datetime.today() - datetime.timedelta(1),
    datetime.datetime.min.time())
default_dag_args = {
    # Setting start date as yesterday starts the DAG immediately when it is
    # detected in the Cloud Storage bucket.
    'start_date': yesterday,
    # To email on failure or retry set 'email' arg to your email and enable
    # emailing here.
    'email_on_failure': False,
    'email_on_retry': False,
    # If a task fails, retry it once after waiting at least 5 minutes
    'retries': 0,
    'retry_delay': datetime.timedelta(minutes=5),
    'project_id': models.Variable.get('gcp_project')
}
network_table							='MyDataSet.table_shard_'
from datetime import timedelta, date
def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)
### start & end date = delta period.
## -3 days?
delta=-3 
start_date = datetime.date.today() + datetime.timedelta(delta)
end_date = datetime.date.today()
today_macro='{{ ds_nodash }}'
 
with models.DAG('BigQueryShardsLoading', schedule_interval='@once', default_args=default_dag_args) as dag:
	for single_date in daterange(start_date, end_date):
		load_to_bq_from_gcs_NetworkImpressions = gcs_to_bq.GoogleCloudStorageToBigQueryOperator(
    		task_id='load_to_bq_from_gcs_'+single_date.strftime("%Y%m%d"),
    		source_objects=[
            	'my_file_prefix_'+single_date.strftime("%Y%m%d")+'*'
        	],
    		skip_leading_rows=1 ,
    		write_disposition='WRITE_TRUNCATE', #overwrite?
    		create_disposition='CREATE_IF_NEEDED',
    		bucket=DST_BUCKET,
    		destination_project_dataset_table=network_table+single_date.strftime("%Y%m%d"),
    		autodetect='true')

Notice the loop, this save you the need write duplicate operator per day and dynamically increase/decrease the amount of days to overwrite each night.

for single_date in daterange(start_date, end_date):

and notice template_fields – task_id as suggested in the airflow docs, each dynamic task required it own unique name

task_id='load_to_bq_from_gcs_'+single_date.strftime("%Y%m%d"),

Also notice the template_fields – source_objects as suggested in the airflow docs, this save you the need to repartition the files to different folders to distinguish between the different 4 files names

source_objects=[
            	'my_file_prefix_'+single_date.strftime("%Y%m%d")+'*'
        	],

also notice the template_fields – destination_project_dataset_table as suggested in the airflow docs this allows you to create table per day based on the filename. i.e a shard.

    		destination_project_dataset_table=network_table+single_date.strftime("%Y%m%d")

From there parsing the date in the next layer of your DAG should be as simple as:

select 
 _TABLE_SUFFIX as dfp_date,
 cast(SUBSTR(Time, 0,10) as date) as impression_date, 
 cast (SUBSTR(Time, 12,8) as time) as impression_time,
 cast (concat(SUBSTR(Time, 0,10),'T',SUBSTR(Time, 12,8)) as datetime) impression_datetime,
 *
 from
 MyDataSet.table_shard_*
  limit 1

Just to give a clear perspective

  1. This jobs costed us 140$ per day.
  2. Now it costs about 20$ per day.
  3. We have more ideas to decrease it to 10$ using SuperQuery Airflow connector. we will disclose them once we implement them.

 

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

architecture, Big Query, cost reduction, GCP Big Data Demystified, superQuery

80% Cost Reduction in Google Cloud BigQuery | Tips and Tricks | Big Query Demystified | GCP Big Data Demystified #2

The second in series of lectures GCP Big Data Demystified. In this lecture I will share with how I saved 80% of BigQuery monthly billing of investing.com. Lectures slides:

Videos from the meetup:

Link to previous lecture GCP Big Data Demystified #1

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

To learn more about superQuery: