Airflow guidelines and data pipelines best practices
Author: Omid Vahdaty 30.8.2020
In this blog I am going to share my guidelines to write good DAGS in airflow.
I am assuming you are using BigQuery or any SQL based data pipelines .
Recurring job is a good job
Each DAG should be written in such a way that if you trigger the dag twice, the resulting data in the table will be the same.
Meaning, no data will be added twice to the same table, i.e no data duplications will be performed in the data.
For example, if you are adding data to BigQuery and using “WRITE_APPEND” disposition ,
you would be much safer if you delete the date range you are trying to add before adding the data.
In this way, the end result will always be the same, even if the job runs twice by mistake instead of just once.
Debuggable job
Each data pipeline should take into account debugging. As opposed to standard development practices with IDE’s and debuggers,
data pipelines usually have no debuggers, making it much harder to debug a failed data pipeline.
Therefore, consider the following:
-
- Have the temporary files you create during the DAG, Saved until the next run of the same dag.
By starting the DAG with a cleanup process , you reserve the options to debug a failure in the data
pipeline using the temporary files. - Have the queries / commands be properly printed in the logs, so you will be able to retrace the steps
in the data pipelines easily from the Airflow GUI and logs. - As for the temporary files – be sure to keep a consistent filenames to avoid
duplications in case of recurring jobs.
- Have the temporary files you create during the DAG, Saved until the next run of the same dag.
Write after Delete
When you delete data from a table – immediately after, you must insert data. Don’t use airflow dummy Operator in between the delete and the insert (write). The idea is that sometimes your data pipeline may be queued due to lack of resources in your Airflow cluster, and you will have a the write operator in “Queued” status waiting for resources to be freed. In this scenario you may end up with missing days in your data for couple of hours or more.
Assume the best, prepare for the worst
Assume your DAG will run daily for a long time but prepare for the worst by thinking about the impact of failed operators in your pipeline. By using airflow trigger rules, and airflow BranchPythonOperator , you can prepare data pipeline logic for standard success of data transfer, and fallback plan in case of failure.
"Create if needed"
Is a big query create disposition flag that allows Airflow BigQueryOperator to create a table in BigQuery if it does not exist. The problem is , that it will not create a partition table. Only use this option if you have no other choice. The prefered way would be to create the table via airflow , this way you can customize the partitions and clusters. Consider using “CREATE_NEVER” instead.
"Go back in time"
The ability to “Go back in time” is super important. Keep the RAW data protected with “NO DELETE” policies. Consequently ,You will be able re-model the data as you require in the future.
Data Replay
Sometimes you need to re-run your data pipeline all over again against a specific period in the data lake. Airflow suggests a unique approach which I personally avoid. I will share what I usually do :
-
- Have the dates in the “where” clause of the SQL query to be taken from a parameter. This way you can put the dates parameter in Airflow variables, and simply change it from the GUI.
- If you made your job debuggable, you should easily copy paste the queries from the log, simply change the “where” clause to what you need and run them on BigQuery manually.
- Where applicable (e.g small tables), use BigQuery views and instead of using “real” tables with tedious data transfers, have a DAG written with a replay variable. i.e if the replay variable is set to true, run the query of full history rebuild, else make the query incremental.
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,
feel free to contact me via LinkedIn: