architecture

How to create your own SaaS

How to create your own SaaS

Author: Galit Koka Elad 13.9.2020​

How to build your own Saas, what is special about SaaS and some principles to consider from the perspective of product management, security, scale, availability, cost & revenues and more.

Video and slides


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

BI

Digital Transformation Demystified

Digital Transformation Demystified​

Author: Omid Vahdaty 7.9.2020​

What is Data Driven decision making and how do I get there?
What is Cloud Native, is my product cloud ready?
What is Digital Transformation, is it relevant to my company?

Video and slides


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

analytics

Guidelines for creating Data Architecture

Guidelines for creating Data Architecture

Author: Omid Vahdaty 30.8.2020​

In this blog I will share my methodology for data architecture

High level guidelines 

  1. Product Discovery: Talk to the management team, understand the product, Type of users, and Type of features.
  2. Data Mapping: Map all the operational data tables, granularity level and foreign keys to other tables. Assume there will  be 3rd party data sources such as GA (free or 360), Firebase analytics etc.
  3. Hypnotize : Generate a list of  desired granularity, dimensions, metrics, measures, attributes, based on business questions created by product and marketing teams.
  4. Create a draft Data Architecture that is layer (Ingestion, transformation, modeling, presentation). read this blog about Big Data Architecture.
  5. Gap Analysis: confirm with the stakeholders that data exists to generate the above data architecture.
  6. Implement the data architecture.

 


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

airflow

Airflow guidelines and data pipelines best practices

Airflow guidelines and data pipelines best practices

Author: Omid Vahdaty 30.8.2020​

In this blog I am going to share my guidelines to write good DAGS in airflow.
I am assuming you are using BigQuery or any SQL based data pipelines .

 

Recurring job is a good job

Each DAG should be written in such a way that if you trigger the dag twice, the resulting data in the table will be the same.
Meaning, no data will be added twice to the same table, i.e no data duplications will be performed in the data.
For example, if you are adding data to BigQuery and using “WRITE_APPEND” disposition ,
you would be much safer if you delete the date range you are trying to add before adding the data.
In this way, the end result will always be the same, even if the job runs twice by mistake instead of just once.

Debuggable job

Each data pipeline should take into account debugging. As opposed to standard development practices with IDE’s and debuggers,
data pipelines usually have no debuggers, making it much harder to debug a failed data pipeline.
Therefore, consider the following:

    1. Have the temporary files you create during the DAG, Saved until the next run of the same dag.
      By starting the DAG with a cleanup process , you reserve the options to debug a failure in the data
      pipeline using the temporary files.
    2. Have the queries / commands be properly printed in the logs, so you will be able to retrace the steps
      in the data pipelines easily from the Airflow GUI and logs.
    3. As for the temporary files – be sure to keep a consistent filenames to avoid
      duplications in case of recurring jobs.

 

Write after Delete

When you delete data from a table – immediately after, you must insert data. Don’t use airflow dummy Operator in between the delete and the insert (write). The idea is that sometimes your data pipeline may be queued due to lack of resources in your Airflow cluster, and you will have a the write operator in “Queued” status waiting for resources to be freed. In this scenario you may end up with missing days in your data for couple of hours or more.

 

Assume the best, prepare for the worst

Assume your DAG will run daily for a long time but prepare for the worst by thinking about the impact of failed operators in your pipeline. By using airflow trigger rules, and airflow BranchPythonOperator , you can prepare data pipeline logic for standard success of data transfer, and fallback plan in case of failure.

"Create if needed"

Is a big query create disposition flag that allows Airflow BigQueryOperator to create a table in BigQuery if it does not exist. The problem is , that it will not create a partition table. Only use this option if you have no other choice. The prefered way would be to create the table via airflow , this way you can customize the partitions and clusters. Consider using “CREATE_NEVER” instead.

"Go back in time"

The ability to “Go back in time” is super important. Keep the RAW data protected with “NO DELETE” policies. Consequently ,You will be able re-model the data as you require in the future.

Data Replay

Sometimes you need to re-run your data pipeline all over again against a specific period in the data lake. Airflow suggests a unique approach which I personally avoid. I will share what I usually do :

    • Have the dates in the “where” clause of the SQL query to be taken from a parameter. This way you can put the dates parameter in Airflow variables, and simply change it from the GUI.
    • If you made your job debuggable, you should easily copy paste the queries from the log, simply change the “where” clause to what you need and run them on BigQuery manually.
    • Where applicable (e.g small tables), use BigQuery views and instead of using “real” tables with tedious data transfers, have a DAG written with a replay variable. i.e if the replay variable is set to true, run the query of full history rebuild, else make the query incremental.


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

airflow

Google Analytics Full ETL

Google Analytics Full ETL

Author: Ariel Yosef 20.8.2020​

In this guide I will show you my full ETL for Google analytics.
First, we will extract our data from google analytics platform via API request using Python.
In order to make an API request, we first need to enable a few things:
(the official documentation of Google Analytics).

Part 1: Authentication

  1. We need to enable our API request from the Google project we are working on.
    (Make sure the email you are using for Google Analytics is the same email as the project)
    If you are not familiar with Google platform, they offer a quick and easy tool to enable the API.
    If you are familiar with google platform, go to “APIs & Services” >> “library” >> “Google Analytics Reporting API”
    Press “enable”
    That will redirect you to the API home screen.
    On the left side of the home screen, you will see the “Credentials” tab – press it.
    Press the “create credentials” and choose “service account”.
    Give it a name and press “create”.
    Now we have created an account that has access to the GA API.
  2. You should see the account under “Service Accounts“ tab. Press the newly created account,
    scroll down a bit and under the tab “key” press create a key, then press “json”.
    After you created that key, it will automatically download the key as a json file.
    (Make sure you don’t lose it, it’s important for our API call).

Part 2 : Python

  1. Take the json file that we have downloaded and copy it to the Python project folder.
  2. Take your “view id” from here and save it, we will use it very soon.
  3. In order to make the call, we will be needing to install a few python packages:
pip install --upgrade google-api-python-client
pip install --upgrade oauth2clientcd 
pip install pandas

  4. Copy that code:

import pandas as pd
from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
import datetime
import argparse
import sys


argparser = argparse.ArgumentParser(add_help=False)
argparser.add_argument('--start_date', type=str,
                      help=('Start date of the requested date range in '
                            'YYYY-MM-DD format.'))
argparser.add_argument('--end_date', type=str,
                      help=('End date of the requested date range in '
                            'YYYY-MM-DD format.'))
args = argparser.parse_args()

data = []
columns = []
start_date = args.start_date
end_date = args.end_date
SCOPES = 'https://www.googleapis.com/auth/analytics.readonly'
KEY_FILE_LOCATION = 'path to the json file'
VIEW_ID = 'past here the view id number'
body = {
   'reportRequests': [
       {
           'viewId': VIEW_ID,
           'dateRanges': [{'startDate': start_date, 'endDate': end_date}],
           'metrics': [{'expression': 'your metrics1'},{'expression': 'your metrics2'}],
           'dimensions': [{'name': 'your dimensions1'},{'name': 'your dimensions2'}]
       }]
}
metrics_len = len(body.get('reportRequests')[0].get('metrics'))
dimensions_len = len(body.get('reportRequests')[0].get('dimensions'))





def initialize_analyticsreporting():
   """Initializes an Analytics Reporting API V4 service object.

   Returns:
     An authorized Analytics Reporting API V4 service object.
   """
   credentials = ServiceAccountCredentials.from_json_keyfile_name(
       KEY_FILE_LOCATION, SCOPES)

   # Build the service object.
   analytics = build('analyticsreporting', 'v4', credentials=credentials)

   return analytics


def get_report(analytics):
   """Queries the Analytics Reporting API V4.

   Args:
     analytics: An authorized Analytics Reporting API V4 service object.
   Returns:
     The Analytics Reporting API V4 response.
   """
   return analytics.reports().batchGet(body=body).execute()


def data_extract(response):
   for report in response.get('reports'):
       columnHeader = report.get('columnHeader', {})
       dimensionHeaders = columnHeader.get('dimensions', [0])
       metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries')
       for number in range(dimensions_len):
           columns.append(dimensionHeaders[number])
       for number in range(metrics_len):
           columns.append(metricHeaders[number].get('name'))
       for report in response.get('reports'):
           for rows in report.get('data').get('rows'):
               for dimensions, metrics in zip([rows.get('dimensions')], rows.get('metrics')):
                   temp = []
                   for number in range(dimensions_len):
                       temp.append(dimensions[number])
                   for number in range(metrics_len):
                       temp.append(metrics.get('values')[number])
                   data.append(temp)


def create_csv_file(data, columns):
   df = pd.DataFrame(data=data, columns=columns)
   df.to_csv(start_date + '_' + end_date + ".csv", index=False)


def main():
   analytics = initialize_analyticsreporting()
   response = get_report(analytics)
   data_extract(response)
   create_csv_file(data, columns)


if __name__ == '__main__':
   main()

  5. Watch the red color and edit them with your options.
In order to take the metrics and dimensions go here

  • Example for dimension and metrics:
'metrics': [{'expression': 'ga:pageviews'}, {'expression': 'ga:sessionDuration'},
               {'expression': 'ga:users'}, {'expression': 'ga:sessions'}, {'expression': 'ga:bounceRate'}],
   'dimensions': [{'name': 'ga:date'}, {'name': 'ga:pagePath'}, {'name': 'ga:source'}, {'name': 'ga:country'}]
}]

Part 3: Test yourself

Lastly, run the Python file with the starting and ending date arguments like that:

python your_python_file_name.py  --start_date 2020-08-08 --end_date 2020-08-08

All done!

Check your project folder, you should see a csv file that is made by the script.
The general idea behind that Python script is to combine it to Airflow and send
dynamic dates to the script. (this is why I use parse args)


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn: