airflow, AWS, Cloud SQL, Data Engineering, GCP

Airflow Demystified | Everything you need to know about installing a DIY LocalExecutor Airflow cluster backed by MySQL Cloud SQL

Whether you are using Google’s Composer or you want experiment with Airflow

there many reason to start using Airflow:

  1. The first and foremost – orchestration visibility and management, full
  2. The second would be – Cloud Vendor Agnostic – u can used it in any cloud like any other open source tech.
  3. Large community
  4. Massive amount of connectors built in.

Reasons to use Airflow DIY cluster:

  1. high customization options like type of several types Executors.
  2. Cost control a GCP compsor starts with a min of 3 nodes – about 300$ monthly. compared with a DYI cluster – start with 5$ monthly for a a Sequential Executor Airflow server or about 40$ for a Local Executor Airflow Cluster backed by Cloud MySQL (with 1 CPU and 4 GB RAM)
  3. Our moto in Big Data Demystified community is : Faster , Cheaper, Simpler. It is easier to connect to the data in the DIY cluster in order to perform some custom analytical reports. such Cost Per DAG (if you can wait for the blog about this, start reading bout cost per Query per user in BigQuery)

Getting started on Airflow? want to learn Airflow by example?

start playing with GCP composer, get to know the basic functionality of airflow.

We already of a good airflow blog for you with examples and slides

How to Install a Sequential Executor airflow server? (standalone, no concurrency) ?

we already have a basic airflow installation blog for your.

Installing a DIY Airflow cluster in LocalExecutor mode?

Tips for DIY cluster

  1. My first tip would RTFM… read the airflow docs.
  2. Generally speaking – get your self very familiar with Airflow.cfg, if you get lost in the documentations… here is a working example of airflow.cfg configuration file.

There several thing you need to consider when deploying a DIY Airflow Cluster

  1. You must have a DB for concurrency in LocalExceutor
  2. The Backed DB in GCP should be Cloud SQL (cheaper instance would do), and then you can connect the Cloud SQL to BQ Via Federated queries . In AWS consider using AWS Aurora for maximum flexibility in billing . To configure Cloud SQL read our blog
  3. If you are using GCP Composer , and you want to connecting the Backed DB, it is not going to be straight forward, you are going to need SQL proxy or SQLAlchemy, and no query federation to BQ will be supported..
  4. You are going to need to create a service on the OS level for start / stop of airflow. Or a basic start airflow / Stop ariflow script if you are lazy like me , just dont forget to stop/start the FUSE from section 5.
  5. For Security at Rest and high availability of data, Consider using FUSE on top for GCS or AWS s3 for DAGs and LOGS folder.
  6. For Security in Motion consider adding to HTTPS and configure your HTTPS certificate in your Airflow cluster.
  7. For Security in Access consider using Airflow GMAIL Authentication or any other solution like LDAP
  8. For basic airflow monitoring on the ETL level, consider the cool using Airflow Slack Integration as explained in our blog. or email integration. There are more options, we shall cover them in the future.

Some more of our blogs about Airflow

The another installation manual for SequentialExecutor:

Airflow SequentialExecutor Installation manual and basic commands

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

airflow, AWS, GCP, Security

Airflow – setup of SSL Certificate – HTTPS example

In this example we are going to use self signed certificates via the below CMD. it turns out the original apache airflow instructions, assume prior knowledge:

https://airflow.apache.org/security.html#ssl

To create the self signed certificates follow the instructions followed by each cmd :

openssl genrsa -out private.pem 2048
openssl req -new -x509 -key private.pem -out cacert.pem -days 1095


In your airflow.cfg under [webserver] change the following keys:

web_server_ssl_cert = path/to/cacert.pem
web_server_ssl_key = path/to/private.pem

This should let you now browse Airflow via https://localhost:8080

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

airflow, AWS, Cloud SQL, Data Engineering, GCP

Airflow MySQL Integration – how to?

The default installation of Airflow come with SQLlite as backend. this mode does not allow concurrency in your DAG’s.

In this blog we will upgrade an Airflow vanilla installation to work with localExceutor and GCP Cloud SQL (MySQL).

  1. Stop Airflow and change the airflow configuration file: airflow.cfg to contain “LocalExecutor”, Note: SequentialExecutor is the default. 🙂
#stop server:  Get the PID of the service you want to stop 
ps -eaf | grep airflow
# Kill the process 
kill -9 {PID}
# The executor class that airflow should use. Choices include
# SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor, KubernetesExecutor
executor = LocalExecutor

2. If you are using cloud, setup AWS RDS Aurora (MySQL) or GCP CloudSQL with minimal resources. Point SQL Alchemy to MySQL (if using MySQL)

sql_alchemy_conn = mysql://{USERNAME}:{PASSWORD}@{MYSQL_HOST}:3306/airflow
  • before you continue – confirm the security group allow incoming network access to your DB from airflow machine (And your MySQL client)
  • make sure user/pass are created to match your needs. also set the password for the root user. make sure you also set the password of the root user.
  • Notice the user/pass are in plain text. this is merely a temporary workaround

3. Setup MySQL (if using MySQL) , creating a new db called airflow and grant the above user permissions.

 CREATE DATABASE airflow CHARACTER SET utf8 COLLATE utf8_unicode_ci; 
CREATE USER 'airflow'@'34.68.35.27' IDENTIFIED BY 'airflow';

grant all on airflow.* TO 'airflow'@'34.68.35.27' IDENTIFIED BY 'airflow'; 

if You are using GCP Cloud SQL, You can use the GUI to create the DB. you can also do it from the airflow machine via installing a mysql client (assuming ubuntu18.04), Note I recommed using both way to connect to the MySQL DB as one is easy, and the other is to test network access from Airflow Cluster to Cloud SQL

sudo apt install -y mysql-client-core-5.7

mysql -u MyUser -h 35.192.167.163 -pMyPassword

if you are using GCP , you can connect via cloud shell and the following cli cmd (remember to set the password for root user):

gcloud sql connect airflow --user=root --quiet

4. Initiate the Airflow tables via

airflow initdb

#confirm there are no errors.if you are getting "explicit_defaults_for_timestamp" #error see the end of this document for a suggested solution how to resolve.

5. Start Airflow

airflow webserver -p 8080

  • Cant connect to MySQL/ unable to connect mysql – check network is allowed, check user/password. test user/pass via Cloud Shell – this will bypass network problem for u. and test network via MySQK CLI once user/pass are confirmed.
  • Global variable explicit_defaults_for_timestamp needs to be on (1) for mysql
  1. In the Google Cloud Platform Console, create a new GCP Console project, or open an existing project by selecting the project name. …see naming guidelines
  2. Open the instance and click Edit.
  3. Scroll down to the Flags section.
  4. To set a flag that has not been set on the instance before, click Add item, choose the flag “explicit_defaults_for_timestamp” from the drop-down menu, and set its value.
  5. Click Save to save your changes.
  6. Confirm your changes under Flags on the Overview page.
  • Notice: all DAGs will be turned off

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

airflow, AWS, GCP

Airflow and Slack Integration

In this bloq I will show you by example how to sent notification to slack on Airflow job failures

  1. Get an API token from slack. create it here in this link: https://api.slack.com/custom-integrations/legacy-tokens. notice this method is deprecated 😦
  2. Install on your airflow server the slack packages :

pip3 install ‘apache-airflow[slack]’

3. simple code example committed in our Github (one for successful event and one for failure):

Slack Airflow integration code example for successful job

Slack Airflow integration code example for failed job

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

architecture, AWS, AWS athena, AWS EMR, Cloud, Data Engineering, Spark

Big Data in 200KM/h | Big Data Demystified

What we’re about

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…

Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.

how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?

Some of our online materials:

Website:

https://big-data-demystified.ninja/

Youtube channels:

https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber

Meetup:

https://www.meetup.com/AWS-Big-Data-Demystified/

https://www.meetup.com/Big-Data-Demystified

Facebook Group :

https://www.facebook.com/groups/amazon.aws.big.data.demystified/

Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)

Audience:

Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D

AWS Big Data Demystified

Tel Aviv-Yafo, IL
729 Members

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The techn…

Check out this Meetup Group →

Big Data Demystified

Tel Aviv-Yafo, IL
873 Members

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The techn…

Next Meetup

Machine Learning Essentials | Big Data Demystified

Wednesday, Sep 4, 2019, 6:00 PM
77 Attending

Check out this Meetup Group →

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/