Whether you are using Google’s Composer or you want experiment with Airflow
there many reason to start using Airflow:
- The first and foremost – orchestration visibility and management, full
- The second would be – Cloud Vendor Agnostic – u can used it in any cloud like any other open source tech.
- Large community
- Massive amount of connectors built in.
Reasons to use Airflow DIY cluster:
- high customization options like type of several types Executors.
- Cost control a GCP compsor starts with a min of 3 nodes – about 300$ monthly. compared with a DYI cluster – start with 5$ monthly for a a Sequential Executor Airflow server or about 40$ for a Local Executor Airflow Cluster backed by Cloud MySQL (with 1 CPU and 4 GB RAM)
- Our moto in Big Data Demystified community is : Faster , Cheaper, Simpler. It is easier to connect to the data in the DIY cluster in order to perform some custom analytical reports. such Cost Per DAG (if you can wait for the blog about this, start reading bout cost per Query per user in BigQuery)
Getting started on Airflow? want to learn Airflow by example?
start playing with GCP composer, get to know the basic functionality of airflow.
We already of a good airflow blog for you with examples and slides
How to Install a Sequential Executor airflow server? (standalone, no concurrency) ?
we already have a basic airflow installation blog for your.
Installing a DIY Airflow cluster in LocalExecutor mode?
Tips for DIT clust
- My first tip would RTFM… read the airflow docs.
- Generally speaking – get your self very familiar with Airflow.cfg, if you get lost in the documentations… here is a working example of airflow.cfg configuration file.
There several thing you need to consider when deploying a DIY Airflow Cluster
- You must have a DB for concurrency in LocalExceutor
- The Backed DB in GCP should be Cloud SQL (cheaper instance would do), and then you can connect the Cloud SQL to BQ Via Federated queries . In AWS consider using AWS Aurora for maximum flexibility in billing . To configure Cloud SQL read our blog
- If you are using GCP Composer , and you want to connecting the Backed DB, it is not going to be straight forward, you are going to need SQL proxy or SQLAlchemy, and no query federation to BQ will be supported..
- You are going to need to create a service on the OS level for start / stop of airflow. Or a basic start airflow / Stop ariflow script if you are lazy like me , just dont forget to stop/start the FUSE from section 5.
- For Security at Rest and high availability of data, Consider using FUSE on top for GCS or AWS s3 for DAGs and LOGS folder.
- For Security in Motion consider adding to HTTPS and configure your HTTPS certificate in your Airflow cluster.
- For Security in Access consider using Airflow GMAIL Authentication or any other solution like LDAP
- For basic airflow monitoring on the ETL level, consider the cool using Airflow Slack Integration as explained in our blog. or email integration. There are more options, we shall cover them in the future.