architecture, Big Query, cost reduction, GCP Big Data Demystified, superQuery

80% Cost Reduction in Google Cloud BigQuery | Tips and Tricks | Big Query Demystified | GCP Big Data Demystified #2

The second in series of lectures GCP Big Data Demystified. In this lecture I will share with how I saved 80% of BigQuery monthly billing of Lectures slides:

Videos from the meetup:

Link to previous lecture GCP Big Data Demystified #1


I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

To learn more about superQuery:

architecture, Big Query

When to use BigQuery? When not to use BigQuery?

When to use BigQuery?

  1. 80% of all data use case would be a good reason to use BigQuery.
  2. If you are not sure – start with BigQuery
  3. If you are using google analytics or firebase
  4. When you have a director connector that load the data for you to BigQuery
  5. When your production is in GCP

When not to use BigQuery:

  1. When are processing more than one 5GB compressed file per hour (load limit of GCP BigQuery per file). Try opening a compressed file of 6GB and splitted into smaller files… your one hour window of time will be shining as the file uncompressed into something huge of ±20 GB ….
  2. When you are hitting an error in BigQuery that says your query is consuming too much resources, ona weekly basis, and there nothing you can do about that.
  3. When you need to self Join Billion of records on a regular basis.
  4. When you you are using complex Window functions. You are likely to get an error of too many resource are being user for your Query and there nothing you can do except rewriting your query.

So what are the alternative to BigQuery:

  1. Hadoop ecosystem: Data Proc / Cloudera
  2. SQream DB , a database designed to handle huge files, massive joins at surprising amount of speed, simplicity and cost effectiveness.

apache, architecture, AWS, AWS EMR, EMR, Hive, presto, Spark, zeppelin

AWS EMR and Hadoop Demystified – Comprehensive training program suggestion for Data Engineers in 200KM/h

This blog Assumes prior knowledge, this to help the reader design training program to newbies on AWS EMR Hadoop. Naturally, my big data perspective is applied here. This blog is FAR FROM BEING PERFECT.

Learn the following in rising order of importance (in my humble opinion).

Quick introduction to big data in 200 KM/h

Beyond the basics….

Hive vs presto Demystified

Hive Demystified

EMR Zeppelin & Zeppelin

EMR Yarn Demystified

EMR Spark Demystified

EMR Livy demystified

EMR Spark and Zeppelin demystified

Rstudio and SparkR demystified

EMR spark Application logging

EMR Monitoring Demystified | EMR Ganglia

EMR spark tuning demystified

EMR Oozie demystified (not common, use airflow instead)


Airflow Use Case: Improving success rate of API calls

Something you ETL needs to call 3 party API’s and the success is not guaranteed, i.e calling the api too much in parallel – we result in failure.

First thing to notice, default arguments:

default_dag_args = {
    'start_date': yesterday,
    'retry_exponential_backoff': True,
    'max_retry_delay': datetime.timedelta(minutes=20),
    # If a task fails, retry it once after waiting at least 5 minutes
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5),
    'project_id': models.Variable.get('gcp_project')

And you can also overwrite the parameters in the Operator:

run_report_remotly_status = BashOperator(task_id='run_report_remotly_'+temp_date,retries=2,retry_delay=datetime.timedelta(seconds=30),retry_exponential_backoff=True,max_retry_delay=datetime.timedelta(minutes=20),bash_command=bash_run_report_remotly_cmd,trigger_rule="all_done")

full example is committed here in our GITHUB.

Security, AWS, architecture, EMR, AWS athena, AWS EMR, AWS Redshift, AWS Aurora, SageMaker, AWS Big Data Demystified, AWS S3

AWS Demystified – Comprehensive training program suggestion for newbies in 200KM/h

This blog Assume prior knowledge, this to help the reader design training program to newbies on AWS. Naturally, my big data perspective is applied here.

Learn the following in rising order of importance (in my humble opinion).

General lightweight introduction to AWS & AWS Big Data :

  1. Create AWS user on AWS account

Start with this. get the obvious out of the way.

  1. AWS S3 (GCS), AWS S3 cli

Be sure to understand the basics, upload, download, copy, move, rsnc from both the GUI and AWS CLI. only then go to other advanced features such as life cycle policy, storage tiers, encyrptions etc.

  1. AWS EC2(Elastic Compute), how to create a machine, how to connect via ssh

Be sure to use T instance to play a round, choose amazon linux or ubuntu. Notice the different OS users name required to ssh to each machine.

Be sure to understand what is

  1. SSH
  2. SSH tunnel
  1. AWS security groups how to add ip and port

Without this section you wont be to access web/ssh machines.

  1. AWS VPC (virtual private network) , only if you feel comfortable around network architecture, otherwise skipt this topic.
  1. AWS RDS (mySQL,aurora)

create Mysql , login, create table, insert data from S3, export data from s3.

understand the difference between AWS RDS aurora and AWS RDS mysql

  1. AWS Redshift learn how to create cluster connect to cluster, and run query , understand the basic architecture.
  2. AWS Athena, how to create external table, how to insert data, partitions, MSCK repair table, parquet, AVRO, querying nested data.
  1. AWS Glue

be sure to understand the 2 role of AWS glue: shared metastore and auto ETL features.

  1. AWS Kinesis, Stream, Firehose, Analytics, you need to understand messaging and streaming. I covered off topic subject here such as flume and kafka.
  1. AWS EMR – (need to talk about with omid).

This is HARD CORE material. highly advanced materials for enterprise grade lectures.

Be sure to understand when to use Hive and when to use SparkSQL and when to use Spark Core. Moreover, the difference between the different node: core node, task node, master node.

  1. AWS IAM, groups, user, role based architecture, encryption at rest, at motion, based policy, identity based policy. By now you should have a basic understanding of what is IAM about. Identity, Authentication, authorization. although, there are many fine grain security issues you need to understand. At the very minimum be sure to understand what is the difference between a role and user, how to write a custom policy, and what is resource based policy vs identity based policy.
  1. AWS Cloud best practices, light wight
  1. AWS ELB, ALB, Auto scaling. [Advanced]
  1. AWS Route53, TBD
  2. AWS security
  1. AWS Lambda
  2. AWS Sage Maker

TBD subject:

API Gateway

AWS cognito.

airflow, AWS S3, GCS

Appsflyer Data Locker Data Pipeline use case – How to copy from AWS S3 bucket to GCS bucket ?

The bussiness usecase:

we wanted the data from Apps flyer , called Data locker which essentially just an AWS s3 bucket. the idea was to sync the data s3 to our GCS , from there to BigQuery. you need a dedicated machine with strong network for RSYNC. (slow operation that may take even 40 min). we splitted each folder and synced it separately in parallel via Airflow.

Setup your authentication of gsutil to read from AWS s3

  1. run in your GCE instance , seperate from airflow.
gsutil config -a

2. go to GCP storage settings

select your project

select “Interoperability”

Under user account HMAC – create key

Copy Access key /secret Key to “gsutil config -a” input when asked

this will create a boto file :


3. Configure s3 Access key secret key in the boto file (additional configurations) under [Credentials]

  aws_access_key_id= xxx
  aws_secret_access_key == yyy

4. you should now be able to run any copy/move/sync command with gsutil on your AWS s3 bucket.

Airflow example to thus this GSUTIL commands on a remote machine (not healthy to run it on the airflow machine). also committed in our GITHUB.

#this assume you have configured HMAC authentication via Boto3 to access AWS s3 via GSUTIL 

rsync_uninstalls_cmd=			'gcloud beta compute --project 	gap---all-sites-1245 ssh apps-flyer	--internal-ip --zone us-central1-c --command "sudo -u omid gsutil -m rsync  -r -x \".*_SUCCESS.*$\"  s3://af-ext-reports/6abc-acc-SuFd4CoB/data-locker-hourly/t=uninstalls/  gs://data_lake_ingestion_us/apps_flyer/t=uninstalls/"'
rsync_installs_cmd=  			'gcloud beta compute --project 	gap---all-sites-1245 ssh apps-flyer	--internal-ip --zone us-central1-c --command "sudo -u omid gsutil -m rsync  -r -x \".*_SUCCESS.*$\"  s3://af-ext-reports/6abc-acc-SuFd4CoB/data-locker-hourly/t=installs/  gs://data_lake_ingestion_us/apps_flyer/t=installs/"'
rsync_organic_uninstall_cmd=  	'gcloud beta compute --project 	gap---all-sites-1245 ssh apps-flyer	--internal-ip --zone us-central1-c --command "sudo -u omid gsutil -m rsync  -r -x \".*_SUCCESS.*$\"  s3://af-ext-reports/6abc-acc-SuFd4CoB/data-locker-hourly/t=organic_uninstalls/  gs://data_lake_ingestion_us/apps_flyer/t=organic_uninstalls/"'

with models.DAG(
        # Continue to run DAG once per day
        default_args=default_dag_args) as dag:

	#dummy - proceed only if success
	start = DummyOperator(task_id='start')
	end = DummyOperator(task_id='end')
	rsync_uninstalls 		= BashOperator( task_id='rsync_uninstalls',	bash_command=rsync_uninstalls_cmd,dag=dag)
	rsync_installs 			= BashOperator( task_id='rsync_installs',	bash_command=rsync_installs_cmd,dag=dag)
	rsync_organic_uninstall = BashOperator( task_id='organic_uninstall',bash_command=rsync_organic_uninstall_cmd,dag=dag)

start >> rsync_uninstalls 			>> end
start >> rsync_installs 			>> end
start >> rsync_organic_uninstall 	>> end