Google Search Console API via Airflow
Author: Omid Vahdaty 22.3.2020
- Start here, by google
- Verified gmail account in google search console – (say your email).
- Get api credentials using the setup tool from the link above
Select your GCP project.
MyProjectName
Which API are you using ?
Google Search Console API
Where you will be calling the api form?
application data
are you planning to use this API with App engine or compute engine?
Yes.
My answer led me to an answer , that i don’t need special authentication – use your own Gmail credentials. I choose to OAuth client ID and created service account anyways. you need the Client ID and Client_SECRET for the python script.
Other UI and User data? –> service acount
Options on getting started on google search console api via python:
- Notice if that fails you can always you data studio connector to Search Console, it works like a charm.
Once you get the list of websites, this means you passed authentication. If the script return no errors and no websites - try another gmail user.
Once you got through authentication issues, try the below manual for detailed data report options:
- install the python package:
pip install --upgrade google-api-python-client
- Open the samples repository and copy both
client_secrets.json
andsearch_analytics_api_sample.py
to your local directory. - Edit
client_secrets.json
, which you copied earlier, replacingclient_id
andclient_secret
with your own values from previous section - run
python search_analytics_api_sample.py https://www.example.com 2015-05-01 2015-05-30
Notice if you used
python webmasters-quickstart.py
The an authtication file was created :
webmasters.dat
Cavities and common errors while connecting to google search API console :
- If your reason is still not authorized – try another gmail user in chrome incognito mode.
- You may want to consider using the gmail user with property owner permissions.
- Pay attention to –noauth_local_webserver flag in search_analytics_api_sample.py, this might help in authentication errors. try using a chrome in cognito browser, login with the authorized user for this.
- Notice the url is accurate in the form of https://www.example.com pr you will get authentication error.
Once you got everything up and running, you going to need dimensions an filter documentation :
https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query
Our python to get metrics for dimensions (country,device,page,query):
Notice:
if you are using a domain property name instead of URL use the below:
https://support.google.com/webmasters/thread/15194304?hl=en
sites that are of type “Domain” are now prefixed with “sc-domain:”
Example
“sc-domain:sitename.com”
python search_analytics_api_sample.py sc-domain:example.com 2020-02-01 2020-02-02
Notice 2: – when you want all the data (limitation of 25000 rows per call):
https://developers.google.com/webmaster-tools/search-console-api-original/v3/how-tos/all-your-data
Notice 3: search console API limits per amount of query per day, TBD.
Notice 4: API keeps 16 months back.
Notice 5: Mozcast report to track changes of google algorithm.
Notice 6: The data brought from the API will be 100% accurate.
Notices7:
– Make sure to load all Rows (loop with startRow & rowLimit (25000) until no more rows)https://developers.google.com/webmaster-tools/search-console-api-original/v3/how-tos/all-your-data Important: Identify errors to avoid partial data loading
* Need to go over each search_type : image / video / web* Should be in the table schema as a field but need to call it separately the API
– Pay attention to Quota https://developers.google.com/webmaster-tools/search-console-api-original/v3/limits
– Pay attention to data aggregation method (page vs. property)
– Initial tables to use* Aggregated data by device, country – To get accurate high level numbers
* Detailed – date, query, page, country, device – To drill down on specific queries/pages
I’m not specifying all fields
for explanation on quotas and limits in search console, read this blog.
Full Airflow DAG that helps calling search console api with concurrency, retry, and qoatas (also committed in out git):
import datetime import os import logging from datetime import timedelta, date from airflow import DAG from airflow import models from airflow.contrib.operators import bigquery_to_gcs from airflow.contrib.operators import gcs_to_bq from airflow.operators.dummy_operator import DummyOperator from airflow.operators import BashOperator from airflow.contrib.operators import gcs_to_gcs from airflow.contrib.operators.bigquery_operator import BigQueryOperator from airflow.utils import trigger_rule from google.cloud import storage #from airflow.utils import trigger_rule yesterday = datetime.datetime.combine( datetime.datetime.today() - datetime.timedelta(1), datetime.datetime.min.time()) default_dag_args = { # Setting start date as yesterday starts the DAG immediately when it is # detected in the Cloud Storage bucket. 'start_date': yesterday, # To email on failure or retry set 'email' arg to your email and enable # emailing here. 'email_on_failure': False, 'email_on_retry': False, # If a task fails, retry it once after waiting at least 5 minutes 'retries': 1, 'concurrency':12, 'max_active_runs':2, 'catchup':False, 'retry_delay': datetime.timedelta(minutes=5), 'project_id': models.Variable.get('gcp_project') } from datetime import timedelta, date def daterange(start_date, end_date): for n in range(int ((end_date - start_date).days)): yield start_date + timedelta(n) ### start & end date = delta period. ## -3 days? delta=-440 start_date = datetime.date.today() + datetime.timedelta(delta) end_date = datetime.date.today() bash_run_report_remotly_cmd='gcloud beta compute --project gap---all-sites-1245 ssh search-console --internal-ip --zone us-central1-c --command "sudo -u omid python /home/omid/search_analytics_api_sample.py"' ### init variables bucket_name2='data_lake_ingestion_us' def get_alphanumeric_task_id(a_string): isalnum = a_string.isalnum() #print('Is String Alphanumeric :', isalnum) alphanumeric_filter = filter(str.isalnum, a_string) alphanumeric_string = "".join(alphanumeric_filter) #remove / from file path return alphanumeric_string.replace("/", "__").replace(".", "_") with models.DAG( 'search_console_with_quata', # Continue to run DAG once per day schedule_interval=None, default_args=default_dag_args) as dag: #dummy - proceed only if success start = DummyOperator(task_id='start') wait = DummyOperator(task_id='wait') end = DummyOperator(task_id='end') for single_date in daterange(start_date, end_date): temp_date=single_date.strftime("%Y-%m-%d") day_after_single_date=single_date+ datetime.timedelta(days = 1) day_after_single_date=day_after_single_date.strftime("%Y-%m-%d") ##notice trigger_rule="all_done" bash_run_report_remotly_cmd='gcloud beta compute --project gap---all-sites-1245 ssh search-console --internal-ip --zone us-central1-c --command "sudo -u omid python /home/omid/search_analytics_api_sample.py sc-domain:investing.com '+temp_date+" "+day_after_single_date+'"' run_report_remotly = BashOperator(task_id='run_report_remotly_'+temp_date,retries=2,retry_delay=datetime.timedelta(minutes=15),retry_exponential_backoff=True,max_retry_delay=datetime.timedelta(hours=48),bash_command=bash_run_report_remotly_cmd,trigger_rule="all_done") start.set_downstream(run_report_remotly) run_report_remotly.set_downstream(wait) mv_to_data_lake = BashOperator( task_id='mv_to_data_lake',bash_command='gcloud beta compute --project gap---all-sites-1245 ssh search-console --internal-ip --zone us-central1-c --command "sudo -u omid gsutil -m mv -r /tmp/search* gs://data_lake_ingestion_us/search_console/"',dag=dag) load="""bq --location US load --source_format CSV --replace=true --skip_leading_rows 1 --allow_quoted_newlines DATA_LAKE_INGESTION_US.search_console_partition gs://data_lake_ingestion_us/search_console/*""" load_to_data_lake = BashOperator( task_id='load_to_data_lake',bash_command=load,dag=dag) wait >> mv_to_data_lake >> load_to_data_lake >> end
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with,
feel free to contact me via LinkedIn:
1 thought on “Google Search Console API via Airflow”