Blog

airflow, AWS, Cloud SQL, Data Engineering, GCP

Airflow Demystified | Everything you need to know about installing a DIY LocalExecutor Airflow cluster backed by MySQL Cloud SQL

Whether you are using Google’s Composer or you want experiment with Airflow

there many reason to start using Airflow:

  1. The first and foremost – orchestration visibility and management, full
  2. The second would be – Cloud Vendor Agnostic – u can used it in any cloud like any other open source tech.
  3. Large community
  4. Massive amount of connectors built in.

Reasons to use Airflow DIY cluster:

  1. high customization options like type of several types Executors.
  2. Cost control a GCP compsor starts with a min of 3 nodes – about 300$ monthly. compared with a DYI cluster – start with 5$ monthly for a a Sequential Executor Airflow server or about 40$ for a Local Executor Airflow Cluster backed by Cloud MySQL (with 1 CPU and 4 GB RAM)
  3. Our moto in Big Data Demystified community is : Faster , Cheaper, Simpler. It is easier to connect to the data in the DIY cluster in order to perform some custom analytical reports. such Cost Per DAG (if you can wait for the blog about this, start reading bout cost per Query per user in BigQuery)

Getting started on Airflow? want to learn Airflow by example?

start playing with GCP composer, get to know the basic functionality of airflow.

We already of a good airflow blog for you with examples and slides

How to Install a Sequential Executor airflow server? (standalone, no concurrency) ?

we already have a basic airflow installation blog for your.

Installing a DIY Airflow cluster in LocalExecutor mode?

Tips for DIT clust

  1. My first tip would RTFM… read the airflow docs.
  2. Generally speaking – get your self very familiar with Airflow.cfg, if you get lost in the documentations… here is a working example of airflow.cfg configuration file.

There several thing you need to consider when deploying a DIY Airflow Cluster

  1. You must have a DB for concurrency in LocalExceutor
  2. The Backed DB in GCP should be Cloud SQL (cheaper instance would do), and then you can connect the Cloud SQL to BQ Via Federated queries . In AWS consider using AWS Aurora for maximum flexibility in billing . To configure Cloud SQL read our blog
  3. If you are using GCP Composer , and you want to connecting the Backed DB, it is not going to be straight forward, you are going to need SQL proxy or SQLAlchemy, and no query federation to BQ will be supported..
  4. You are going to need to create a service on the OS level for start / stop of airflow. Or a basic start airflow / Stop ariflow script if you are lazy like me , just dont forget to stop/start the FUSE from section 5.
  5. For Security at Rest and high availability of data, Consider using FUSE on top for GCS or AWS s3 for DAGs and LOGS folder.
  6. For Security in Motion consider adding to HTTPS and configure your HTTPS certificate in your Airflow cluster.
  7. For Security in Access consider using Airflow GMAIL Authentication or any other solution like LDAP
  8. For basic airflow monitoring on the ETL level, consider the cool using Airflow Slack Integration as explained in our blog. or email integration. There are more options, we shall cover them in the future.

airflow, Cloud SQL, GCP

How to connect GCP Composer cloud SQL database ? and migrate data to BQ? Is it possible? What are my options?

In this blog I will connect to Airflow Cloud SQL Database which is in the Tenant Project of the Composer Environment [1]

You can do that by installing the SQLAlchemy toolkit and following these instructions [2].
Or you could also remotely authorize access to your Cloud SQL database from an application, client, or other GCP service, Cloud Composer provides the Cloud SQL proxy in the GKE cluster [3]

[1] https://cloud.google.com/composer/docs/concepts/overview#architecture
[2] https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies#installing_sqlalchemy_to_access_the_airflow_database
[3] https://cloud.google.com/sql/docs/mysql/connect-kubernetes-engine

How to migrate data from GCP Composer to BigQuery?

After researching your question, I found this documentation link[1] which tells you how to access the Cloud SQL database of Composer from another product, and this other one[2], which explains how to integrate BigQuery and Cloud SQL.

[1]: https://cloud.google.com/composer/docs/concepts/overview#cloud-sql
[2]: http://googlecloudsql.tobigquery.com/

BigQuery and Federated queries of the composer backed database Cloud SQL?

Using BigQuery federated queries to access the Cloud SQL instance that runs within the GKE cluster of a Composer environment is not possible. The only way to access the SQL instance data is using a Cloud SQL proxy from the GKE cluster

Deep Learning, Machine Learning

Is Deep Learning better than Machine Learning?

Here are some reason to prefer deep learning:

  • Deep learning models reduces the need for feature engineering and data preprocessing. This is particularly true in computer vision and natural language–related domains.
  • Deep learning models are more robust in the presence of noise. Deep learning models can adapt to unique problems and are less affected by messy data.
  • In some cases, deep learning delivers higher accuracy than other techniques for problems, particularly when data from a variety of sources must be used to address a problem.
  • deep learning is powerful when we have access to lots of training data or when we have many dimensions or features of the data (time consuming feature extraction), for example
    images, video, and audio.
airflow, GCP

mounting GCS as FUSE for Airflow

Full instructions can be found here, However this blog to make things simpler 🙂

  • Add the gcsfuse distribution URL as a package source and import its public key:
export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
  • Update the list of packages available and install gcsfuse.
sudo apt-get update 
sudo apt-get install gcsfuse 
  • Create a directory:
mkdir ~/airflow/dags
  • create target bucket in GCS (using the GUI is ok ) , e.g bucket name: gs://airflow-fuse-bucket
  • Use Cloud Storage FUSE to mount the bucket :
gcsfuse airflow-fuse-bucket ~/airflow/dags
ls ~/airflow/dags
  • don’t forget to add the mounts to your automation of starting airflow
  • dont forget also to change the Airflow.cfg to the correct path for logs and dags

Big Query, Cloud SQL, Data Engineering, GCP

How can I connect BigQuery and GCP Cloud SQL ? | GCP Cloud SQL federated queries

In order to run a query from GCP BigQuery on a table in GCP Cloud SQL , you need to perform the following steps:

In order to perform a new query in Bigquery using data from database of a Cloud SQL instance, you must first create a new connection through Bigquery to the cloud SQL instance [1].

Once the connection with the Cloud SQL instance has been created, you can create new queries in Bigquery, retrieving data from a database in the Cloud SQL instance [2].

Bear in mind that the “connection id” is obtained in the first step to be able to enter it in the query in the second step.

Example of federated Query:

  SELECT * FROM EXTERNAL_QUERY("projects/<YOUR PROJECT ID&gt;/locations/<YOUR LOCATION&gt;/connections/<YOUR CONNECTION ID&gt;", "SELECT * FROM DATASET.TABLE")

notice the “Your connection ID” is the id you supplied while creating the external connection in BigQuery.

[1] https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries#create_a_connection_resource
[2] https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries#federated_query_syntax

Common errors:

Common errors: connections

“Invalid table-valued function EXTERNAL_QUERY Failed to connect to MySQL database. Error: MysqlErrorCode(2013): Lost connection to MySQL server during query at”

Common errors: Regions

The BigQuery dataset and the Cloud SQL instance must be in the same region, or same location if the dataset is in a multi-region location such as US and EU. check you region is supported.

common error Public IP :

Cloud SQL must have a public IP