Google Analytics Full ETL
Author: Ariel Yosef 20.8.2020
In this guide I will show you my full ETL for Google analytics.
First, we will extract our data from google analytics platform via API request using Python.
In order to make an API request, we first need to enable a few things:
(the official documentation of Google Analytics).
Part 1: Authentication
- We need to enable our API request from the Google project we are working on.
(Make sure the email you are using for Google Analytics is the same email as the project)
If you are not familiar with Google platform, they offer a quick and easy tool to enable the API.
If you are familiar with google platform, go to “APIs & Services” >> “library” >> “Google Analytics Reporting API”
Press “enable”
That will redirect you to the API home screen.
On the left side of the home screen, you will see the “Credentials” tab – press it.
Press the “create credentials” and choose “service account”.
Give it a name and press “create”.
Now we have created an account that has access to the GA API. - You should see the account under “Service Accounts“ tab. Press the newly created account,
scroll down a bit and under the tab “key” press create a key, then press “json”.
After you created that key, it will automatically download the key as a json file.
(Make sure you don’t lose it, it’s important for our API call).
Part 2 : Python
- Take the json file that we have downloaded and copy it to the Python project folder.
- Take your “view id” from here and save it, we will use it very soon.
- In order to make the call, we will be needing to install a few python packages:
pip install --upgrade google-api-python-client pip install --upgrade oauth2clientcd pip install pandas
4. Copy that code:
import pandas as pd
from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
import datetime
import argparse
import sys
argparser = argparse.ArgumentParser(add_help=False)
argparser.add_argument('--start_date', type=str,
help=('Start date of the requested date range in '
'YYYY-MM-DD format.'))
argparser.add_argument('--end_date', type=str,
help=('End date of the requested date range in '
'YYYY-MM-DD format.'))
args = argparser.parse_args()
data = []
columns = []
start_date = args.start_date
end_date = args.end_date
SCOPES = 'https://www.googleapis.com/auth/analytics.readonly'
KEY_FILE_LOCATION = 'path to the json file'
VIEW_ID = 'past here the view id number'
body = {
'reportRequests': [
{
'viewId': VIEW_ID,
'dateRanges': [{'startDate': start_date, 'endDate': end_date}],
'metrics': [{'expression': 'your metrics1'},{'expression': 'your metrics2'}],
'dimensions': [{'name': 'your dimensions1'},{'name': 'your dimensions2'}]
}]
}
metrics_len = len(body.get('reportRequests')[0].get('metrics'))
dimensions_len = len(body.get('reportRequests')[0].get('dimensions'))
def initialize_analyticsreporting():
"""Initializes an Analytics Reporting API V4 service object.
Returns:
An authorized Analytics Reporting API V4 service object.
"""
credentials = ServiceAccountCredentials.from_json_keyfile_name(
KEY_FILE_LOCATION, SCOPES)
# Build the service object.
analytics = build('analyticsreporting', 'v4', credentials=credentials)
return analytics
def get_report(analytics):
"""Queries the Analytics Reporting API V4.
Args:
analytics: An authorized Analytics Reporting API V4 service object.
Returns:
The Analytics Reporting API V4 response.
"""
return analytics.reports().batchGet(body=body).execute()
def data_extract(response):
for report in response.get('reports'):
columnHeader = report.get('columnHeader', {})
dimensionHeaders = columnHeader.get('dimensions', [0])
metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries')
for number in range(dimensions_len):
columns.append(dimensionHeaders[number])
for number in range(metrics_len):
columns.append(metricHeaders[number].get('name'))
for report in response.get('reports'):
for rows in report.get('data').get('rows'):
for dimensions, metrics in zip([rows.get('dimensions')], rows.get('metrics')):
temp = []
for number in range(dimensions_len):
temp.append(dimensions[number])
for number in range(metrics_len):
temp.append(metrics.get('values')[number])
data.append(temp)
def create_csv_file(data, columns):
df = pd.DataFrame(data=data, columns=columns)
df.to_csv(start_date + '_' + end_date + ".csv", index=False)
def main():
analytics = initialize_analyticsreporting()
response = get_report(analytics)
data_extract(response)
create_csv_file(data, columns)
if __name__ == '__main__':
main()
5. Watch the red color and edit them with your options.
In order to take the metrics and dimensions go here
- Example for dimension and metrics:
'metrics': [{'expression': 'ga:pageviews'}, {'expression': 'ga:sessionDuration'},
{'expression': 'ga:users'}, {'expression': 'ga:sessions'}, {'expression': 'ga:bounceRate'}],
'dimensions': [{'name': 'ga:date'}, {'name': 'ga:pagePath'}, {'name': 'ga:source'}, {'name': 'ga:country'}]
}]
Part 3: Test yourself
Lastly, run the Python file with the starting and ending date arguments like that:
python your_python_file_name.py --start_date 2020-08-08 --end_date 2020-08-08
All done!
Check your project folder, you should see a csv file that is made by the script.
The general idea behind that Python script is to combine it to Airflow and send
dynamic dates to the script. (this is why I use parse args)
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,
feel free to contact me via LinkedIn: