Blog

architecture, Big Query, cost reduction, GCP Big Data Demystified, superQuery

80% Cost Reduction in Google Cloud BigQuery | Tips and Tricks | Big Query Demystified | GCP Big Data Demystified #2

The second in series of lectures GCP Big Data Demystified. In this lecture I will share with how I saved 80% of BigQuery monthly billing of investing.com. Lectures slides:

Videos from the meetup:

Link to previous lecture GCP Big Data Demystified #1

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

To learn more about superQuery:

Uncategorised

How to connect to Google Ad manager via python API | GAM demystified

  1. To simply authentication, You need a G Suite gmail user with admin permission to GAM console.
  2. make sure API access is enable in Google Ad Manager
  3. You need a network from google ad manager.
  4. Install a GCE machine with debian, Notice your neet python 3.6 for any version higher than v201911
sudo apt update
sudo apt -y install python-pip python3-venv python3-pip
sudo apt-get -y install git
git clone https://github.com/googleads/googleads-python-lib.git
pip3 install googleads
pip install google-auth-oauthlib

and run example

cd ~/googleads-python-lib/examples/ad_manager/v201911/report_service
python3 run_reach_report.py

This will provide an authentication error:

  File "/home/analyticsynet/.local/lib/python3.5/site-packages/googleads/common.py", line 248, in LoadFromStorage
    'Given yaml file, %s, could not be opened.' % path)
googleads.errors.GoogleAdsValueError: Given yaml file, /home/omid/googleads.yaml, could not be opened.

you need to create OAuth json and get cilentID and ClientSecret, follow these instructions:

https://github.com/googleads/googleads-python-lib#getting-started

Notice OAuth should be blank for all fields and application type is “other” as stated here

and test authentication via:

cd ~/googleads-python-lib/examples/ad_manager/authentication

python generate_refresh_token.py --client_id INSERT_CLIENT_ID --client_secret INSERT_CLIENT_SECRET

if you are getting an error like the below, this means you have not chosen application type “other” in OAuth:

Issue:
The redirect URI in the request, urn:ietf:wg:oauth:2.0:oob, can only be used by a Client ID for native application. It is not allowed for the WEB client type. You can create a Client ID for native application at

Once you do that – you can now authenticate via the the googleads.yaml file :

  1. Copy the googleads.yaml file to your home directory.
  2. This will be used to store credentials and other settings that can be loaded to initialize a client.
  3. update client and secret client inside the YAML
  4. You also need to get network code in GAM.

If this does not work – try the services account options this manul to get the high level process:

  1. create a service account in https://console.developers.google.com/ , don’t forget to choose json method and download the json private key – this can only happen once. this will create an email like: omid-test1@myproject.iam.gserviceaccount.com
  2. Add service account in google ad manger console. use the email above section.
  3. confirm the user is in Active status before continuing.
  4. copy the json private key to the machine home folder. Icalled it my_gcp_private_key_service_account.json
  5. setup the googleads.yaml file:
ad_manager:
  application_name: INSERT_APPLICATION_NAME_HERE
  network_code: INSERT_NETWORK_CODE_HERE
  path_to_private_key_file: INSERT_PATH_TO_FILE_HERE

You can test quickly via adding json path of the private key of the above services account. update the KEY_FILE (you json key), APPLICATION_NAME (your application)

~/googleads-python-lib/examples/ad_manager/authentication
nano create_ad_manager_client_with_service_account.py 

run the script:

python3 create_ad_manager_client_with_service_account.py 

Expected output

This library is being run by an unsupported Python version (3.5.3). In order to benefit from important security improvements and ensure compatibility with this libr
ary, upgrade to Python 3.6 or higher.
Network with network code "1234" and display name "myRealAppName" was found.

a sample report example with this services account connection in our Git:

#!/usr/bin/env python
#
# Copyright 2014 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Initializes a AdManagerClient using a Service Account."""


from googleads import ad_manager
from googleads import oauth2
from googleads import errors
import tempfile


# OAuth2 credential information. In a real application, you'd probably be
# pulling these values from a credential storage.
KEY_FILE = '/home/omid/111bd7ea6ae8534.json'

# Ad Manager API information.
APPLICATION_NAME = 'myApp'
NETWORK_CODE='6690'

def main(key_file, application_name):
  oauth2_client = oauth2.GoogleServiceAccountClient(
      key_file, oauth2.GetAPIScope('ad_manager'))

  client = ad_manager.AdManagerClient(
      oauth2_client, application_name, NETWORK_CODE)

  #networks = ad_manager_client.GetService('NetworkService').getAllNetworks()
  #for network in networks:
  #  print('Network with network code "%s" and display name "%s" was found.'
  #        % (network['networkCode'], network['displayName']))

  # Initialize a DataDownloader.
  report_downloader = client.GetDataDownloader(version='v201911')

 # Create report job.
  report_job = {
      'reportQuery': {
          'dimensions': ['DATE', 'AD_UNIT_NAME'],
          'adUnitView': 'HIERARCHICAL',
          'columns': ['AD_SERVER_IMPRESSIONS', 'AD_SERVER_CLICKS',
                      'ADSENSE_LINE_ITEM_LEVEL_IMPRESSIONS',
                      'ADSENSE_LINE_ITEM_LEVEL_CLICKS',
                      'TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS',
                      'TOTAL_LINE_ITEM_LEVEL_CPM_AND_CPC_REVENUE'],
          'dateRangeType': 'LAST_WEEK'
      }
  }


  try:
    # Run the report and wait for it to finish.
    report_job_id = report_downloader.WaitForReport(report_job)
  except errors.AdManagerReportError as e:
    print('Failed to generate report. Error was: %s' % e)
 
 # Change to your preferred export format.
  export_format = 'CSV_DUMP'

  report_file = tempfile.NamedTemporaryFile(suffix='.csv.gz', delete=False)

  # Download report data.
  report_downloader.DownloadReportToFile(
      report_job_id, export_format, report_file)

  report_file.close()
  
    # Display results.
  print('Report job with id "%s" downloaded to:\n%s' % (
      report_job_id, report_file.name))


if __name__ == '__main__':
  main(KEY_FILE, APPLICATION_NAME)

You could you date range enums to in ReportService.DateRangeType and change the date range, but that would not fit into your airflow plan. Think Dynamic operators like we did in similarweb airflow blog. you may also need to overwrite partitions of dates.

Lets now you wish to add to python script command line argument of start date and end date. full example of google ad manager python with date range committed to out git and is also below:

import sys, getopt

from datetime import datetime
from datetime import timedelta

from googleads import ad_manager
from googleads import oauth2
from googleads import errors
import tempfile


# OAuth2 credential information. In a real application, you'd probably be
# pulling these values from a credential storage.
KEY_FILE = '/home/omid/my_gcp_private_key_service_account.json'

# Ad Manager API information.
APPLICATION_NAME = 'jutomate'
NETWORK_CODE='6690'

def report(key_file, application_name,startDate,endDate):
  oauth2_client = oauth2.GoogleServiceAccountClient(
      key_file, oauth2.GetAPIScope('ad_manager'))

  client = ad_manager.AdManagerClient(
      oauth2_client, application_name, NETWORK_CODE)

  #networks = ad_manager_client.GetService('NetworkService').getAllNetworks()
  #for network in networks:
  #  print('Network with network code "%s" and display name "%s" was found.'
  #        % (network['networkCode'], network['displayName']))

  # Initialize a DataDownloader.
  report_downloader = client.GetDataDownloader(version='v201911')
  # Set the start and end dates of the report to run (past 0 days, you can change to what u need).
  end_date = datetime.strptime( startDate, "%Y-%m-%d").date()
  
  start_date = datetime.strptime( endDate, "%Y-%m-%d").date()
  
  print ('start_date: ', start_date)
  print ('end_date: ', end_date)
  
  report_filename_prefix='report_example_using_service_account_with_date_range'
  
 # Create report job.
  report_job = {
      'reportQuery': {
          'dimensions': ['DATE', 'AD_UNIT_NAME'],
          'adUnitView': 'HIERARCHICAL',
          'columns': ['AD_SERVER_IMPRESSIONS', 'AD_SERVER_CLICKS',
                      'ADSENSE_LINE_ITEM_LEVEL_IMPRESSIONS',
                      'ADSENSE_LINE_ITEM_LEVEL_CLICKS',
                      'TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS',
                      'TOTAL_LINE_ITEM_LEVEL_CPM_AND_CPC_REVENUE'],
          'dateRangeType': 'CUSTOM_DATE',
          'startDate': start_date,
          'endDate': end_date
      }
  }


  try:
    # Run the report and wait for it to finish.
    report_job_id = report_downloader.WaitForReport(report_job)
  except errors.AdManagerReportError as e:
    print('Failed to generate report. Error was: %s' % e)
 
 # Change to your preferred export format.
  export_format = 'CSV_DUMP'

  report_file = tempfile.NamedTemporaryFile(suffix='_'+report_filename_prefix+'_'+startDate+'__'+endDate+'.csv.gz', delete=False)

  # Download report data.
  report_downloader.DownloadReportToFile(
      report_job_id, export_format, report_file)

  report_file.close()
  
    # Display results.
  print('Report job with id "%s" downloaded to:\n%s' % (
      report_job_id, report_file.name))

def main(argv):
   startDate = ''
   endDate = ''
   try:
      opts, args = getopt.getopt(argv,"hi:o:",["start=","end="])
   except getopt.GetoptError:
      print ('example_python_command_line_arguments.py -s <startDate> -e <endDate>')
      sys.exit(2)
   for opt, arg in opts:
      if opt == '-h':
         print ('example_python_command_line_arguments.py -s <startDate> -e <endDate>')
         sys.exit()
      elif opt in ("-s", "--start"):
         startDate = arg
      elif opt in ("-e", "--end"):
         endDate = arg
   print ('start date is ', startDate)
   print ('end   date is ', endDate)
   report(KEY_FILE, APPLICATION_NAME,startDate,endDate)
   
if __name__ == '__main__':
  main(sys.argv[1:])
© 2020 GitHub, Inc.
Terms
Privacy
Security
Status

BI

How to refresh tableau extract via python API ?

In this blog , I am going share with you how to refresh extact using API calls.

  1. Install tableau python api packages
  2. also via pip: ” pip install tableauserverclient “
  3. use the this example python
  4. you are going to need: user, password, tableau sever ip address and workbook id.
  5. the workbook id is Luid (Workbooks) – this was the harded this to get. the idea was to connect to the postgress DB of tableau after the command runs enable a read only user and get the ID’s. Credit goes to Tomer Ben Tovim

The python package install :

pip install tableauserverclient

the tableau api call full example is committed in out git

from tableausdk import *
from tableausdk.HyperExtract import *
import tableauserverclient as TSC
user='omid'
password='123'
tableau_auth = TSC.TableauAuth(user, password)
server = TSC.Server('http://10.1.0.1')
server.version = '3.6'
resource_id= 6109
with server.auth.sign_in(tableau_auth):
        print('connection made')
        print(server.version)
        #resource = server.workbooks.get_by_id(resource_id)
        server.workbooks.refresh(workbook_id='37A13D3E-64D9-4F9B-ACD5-2FCB0291BF24')
server.auth.sign_out()
print('connection closed')
© 2020 GitHub, Inc.)

Resources:

https://help.tableau.com/current/pro/desktop/en-us/examples_postgresql.htm
enable repository
https://help.tableau.com/current/server/en-us/perf_collect_server_repo.htm
command :tsm data-access repository-access enable –repository-username readonly –repository-password

https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_concepts_versions.htm

https://community.tableau.com/thread/287767

https://help.tableau.com/current/api/extract_api/en-us/Extract/extract_api_installing.htm

https://help.tableau.com/current/api/extract_api/en-us/Extract/extract_api_using_python.htm

Some other useful tableau related blogs:

How to install tableau on GCP

Tableau Demystified | Quick introduction in 10 minutes

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS, AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Lambda, AWS Redshift, Hive, meetup, Uncategorised

200KM/h overview on Big Data in AWS | Part 2

in this lecture we are going to cover AWS Big Data PaaS technologies used to model and visualize data using a suggested architecture and some basic big data architecture rule of thumbs.

For more meetups:
https://www.meetup.com/Big-Data-Demystified/

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS, AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Lambda, AWS Redshift, Hive

200KM/h overview on Big Data in AWS | Part 1

in this lecture we are going to cover AWS Big Data PaaS technologies used to ingest and transform data. Moreover, we are going to demonstrate a business use case, suggested architecture, some basic big data architecture rule of thumbs.

For more meetups:
https://www.meetup.com/Big-Data-Demystified/

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

BI, GCP

Tableau Demystified | Install Tableau Server on GCP

This is a straight forward “Cut the bulltshit and give me what I need” manual to install Tableau on windows machine on GCP. Naturally, added my personal tips.

  1. Be sure to install Windows with Desktop (I may try later with linux, and more complex HW environments , stay tuned)
  2. Machine type – minimum 8 cores (to maximize network bandwidth), 30 GB ram, 1024 GB SSD to maximize IO. Be sure to install the machine in a region close to your GCP BigQuery dataset region. assuming public IP 1.2.3.4
  3. Assign password to the user
  4. Download chrome RDP plugin for GCP (optional)
  5. make sure you added port 3389 to your network ingress ports.
  6. RDP to the machine, Turn Off Windows Firewall.
  7. download the tableau server and Run it.
  8. login to TSM requires a windows user in admin group. I created a new user with an easier password for this installation.
  9. You need a license for tableau server, usually , you can use it once for PROD and twice for NON PROD. Note you have a 14 day trial options for days. you have an option to offline license activation for air gapped environment.
  10. Once installation is done, you are going to configure admin user, and open port 80 to the instance. access the server from your desktop using web browser and you server IP. eg. https://1.2.3.4
  11. from Tableau desktop, sign to tableau server using the credentials from step 10. you should be able to publish now.
  12. Notice when you create a scheduled extract expected metrics are 40% cpu utilization(!) and about 15 GB RAM. Network will unlikely be bottleneck, but will ossicalte from 0 to 5Mbps (about 10000 per second). This blog is about extract optimization. also log into to TSM and add more instances based on the amount of CPU cores.
  13. Notice after restart, login to the TSM and start tableau server.

Below is example of CPU core allocation for Tableau

Video Blog on getting started on Tableau Desktop:

Tableau Demystified | Quick introduction in 10 minutes

Resources to improve performance of Tableau extracts :

https://help.tableau.com/current/server/en-us/install_config_top.htm

https://www.tableau.com/about/blog/2018/4/zulilys-top-10-tips-self-service-analytics-google-bigquery-and-tableau-84969

https://community.tableau.com/docs/DOC-23150

https://community.tableau.com/docs/DOC-23161

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/