Why the author of “We’re All Using Airflow Wrong and How to Fix It” needs to attend my meetup!
Author: Omid Vahdaty 11.2.2020
In this article “We’re All Using Airflow Wrong and How to Fix It“, The Author, Has claimed there 3 cons/reasons for why we are using airflow wrongly:
- First, because each step of this DAG is a different functional task, each step is created using a different Airflow Operator. Developers must spend time researching, understanding, using, and debugging the Operator they want to use
- This also means that each time a developer wants to perform a new type of task, they must repeat all of these steps with a new Operator, which sometimes may end up hitting an unexpected permissions error
- The third problem is that Operators are executed on the Airflow workers themselves. The Airflow Scheduler, which runs on Kubernetes Pod A, will indicate to a Worker, which runs on Kubernetes Pod B, that an Operator is ready to be executed. At that point, the Worker will pick up the Operator and execute the work directly on Pod B. This means that all Python package dependencies from each workflow will need to be installed on each Airflow Worker for Operators to be executed successfully
Why I think the author is wrong or to the very least inaccurate:
- The suggestd cons and solution are correct per the use case the author has mentions. not all use case are the same. It seems like the Author is talking on a use case where a cluster of thousand of nightly operators are required.
- Regarding Reason #1, using a short list of Operators from the documentations, and examples of usage should resolve this. Download custom made operators is not a good idea. Managing packages per node is resolved, if you are using using single node with LocalExecutor.
- Regarding Reason #2, Yes. Using Airflow operators correctly require a learning curve, but in most use case, you only learn it once, and use it ALOT in your DAGS. So a new operator might take you a day to learn, However, it will require 10 min to implement on the next time. By the way, this principal is called learning curve, and it is applicable to any new technology or ETL tool.
- There are many ways to install airflow one of them using single node with LocalExecutor. Some of the POC’s and blogs I performed on Airflow crearly state you can achieve A LOT in terms of performance using one instance. Thus, rendering reason #3 invalid.
Don’t get me wrong…
The author has suggested a nice solution – but has failed to address the issues of matching his solution to other use cases with small nightly usages of airflow. Matching The use case to the technical solution IS A leading principal in my meetup.
You are more than welcome to review some other Airflow use case and meetup blogs:
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with,
feel free to contact me via LinkedIn:
2 thoughts on “Why the author of “We’re All Using Airflow Wrong and How to Fix It” needs to attend my meetup!”
Hello Omid,
I also thought that the complaint about the learning curve was unjustified. If I understood correctly, the solution proposed in the Article does not eliminate it at all. In fact – it enlarges: since I am no longer using prefabricated Operators – it is up to me to study the API of each subsystem I want to trigger. While it gives complete flexibility – it complicates the simple things.
However, from your answer I do understand that Airflow does not have an elegant solution for
isolated environments in terms of packaging. I can totally see how this can become an issue in large deployments.
But, as you said- it is a matter of the use case.
I am working on a POC for isolated environment – when i will nail it – i will publish something .