Google Analytics Full ETL

Author: Ariel Yosef 20.8.2020​

In this guide I will show you my full ETL for Google analytics.
First, we will extract our data from google analytics platform via API request using Python.
In order to make an API request, we first need to enable a few things:
(the official documentation of Google Analytics).

Part 1: Authentication

  1. We need to enable our API request from the Google project we are working on.
    (Make sure the email you are using for Google Analytics is the same email as the project)
    If you are not familiar with Google platform, they offer a quick and easy tool to enable the API.
    If you are familiar with google platform, go to “APIs & Services” >> “library” >> “Google Analytics Reporting API”
    Press “enable”
    That will redirect you to the API home screen.
    On the left side of the home screen, you will see the “Credentials” tab – press it.
    Press the “create credentials” and choose “service account”.
    Give it a name and press “create”.
    Now we have created an account that has access to the GA API.
  2. You should see the account under “Service Accounts“ tab. Press the newly created account,
    scroll down a bit and under the tab “key” press create a key, then press “json”.
    After you created that key, it will automatically download the key as a json file.
    (Make sure you don’t lose it, it’s important for our API call).

Part 2 : Python

  1. Take the json file that we have downloaded and copy it to the Python project folder.
  2. Take your “view id” from here and save it, we will use it very soon.
  3. In order to make the call, we will be needing to install a few python packages:
pip install --upgrade google-api-python-client
pip install --upgrade oauth2clientcd 
pip install pandas

  4. Copy that code:

import pandas as pd
from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
import datetime
import argparse
import sys

argparser = argparse.ArgumentParser(add_help=False)
argparser.add_argument('--start_date', type=str,
                      help=('Start date of the requested date range in '
                            'YYYY-MM-DD format.'))
argparser.add_argument('--end_date', type=str,
                      help=('End date of the requested date range in '
                            'YYYY-MM-DD format.'))
args = argparser.parse_args()

data = []
columns = []
start_date = args.start_date
end_date = args.end_date
KEY_FILE_LOCATION = 'path to the json file'
VIEW_ID = 'past here the view id number'
body = {
   'reportRequests': [
           'viewId': VIEW_ID,
           'dateRanges': [{'startDate': start_date, 'endDate': end_date}],
           'metrics': [{'expression': 'your metrics1'},{'expression': 'your metrics2'}],
           'dimensions': [{'name': 'your dimensions1'},{'name': 'your dimensions2'}]
metrics_len = len(body.get('reportRequests')[0].get('metrics'))
dimensions_len = len(body.get('reportRequests')[0].get('dimensions'))

def initialize_analyticsreporting():
   """Initializes an Analytics Reporting API V4 service object.

     An authorized Analytics Reporting API V4 service object.
   credentials = ServiceAccountCredentials.from_json_keyfile_name(

   # Build the service object.
   analytics = build('analyticsreporting', 'v4', credentials=credentials)

   return analytics

def get_report(analytics):
   """Queries the Analytics Reporting API V4.

     analytics: An authorized Analytics Reporting API V4 service object.
     The Analytics Reporting API V4 response.
   return analytics.reports().batchGet(body=body).execute()

def data_extract(response):
   for report in response.get('reports'):
       columnHeader = report.get('columnHeader', {})
       dimensionHeaders = columnHeader.get('dimensions', [0])
       metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries')
       for number in range(dimensions_len):
       for number in range(metrics_len):
       for report in response.get('reports'):
           for rows in report.get('data').get('rows'):
               for dimensions, metrics in zip([rows.get('dimensions')], rows.get('metrics')):
                   temp = []
                   for number in range(dimensions_len):
                   for number in range(metrics_len):

def create_csv_file(data, columns):
   df = pd.DataFrame(data=data, columns=columns)
   df.to_csv(start_date + '_' + end_date + ".csv", index=False)

def main():
   analytics = initialize_analyticsreporting()
   response = get_report(analytics)
   create_csv_file(data, columns)

if __name__ == '__main__':

  5. Watch the red color and edit them with your options.
In order to take the metrics and dimensions go here

  • Example for dimension and metrics:
'metrics': [{'expression': 'ga:pageviews'}, {'expression': 'ga:sessionDuration'},
               {'expression': 'ga:users'}, {'expression': 'ga:sessions'}, {'expression': 'ga:bounceRate'}],
   'dimensions': [{'name': 'ga:date'}, {'name': 'ga:pagePath'}, {'name': 'ga:source'}, {'name': 'ga:country'}]

Part 3: Test yourself

Lastly, run the Python file with the starting and ending date arguments like that:

python  --start_date 2020-08-08 --end_date 2020-08-08

All done!

Check your project folder, you should see a csv file that is made by the script.
The general idea behind that Python script is to combine it to Airflow and send
dynamic dates to the script. (this is why I use parse args)

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

Leave a Reply