Airflow Performance tuning in 5 min
Author: Omid Vahdaty 23.3.2019
When you are using dynamic operators , the default settings will not work.I suggest the below settings.
Notice this configuration is provided AS IS.
Usually the config file is located in ~/airflow/airflow.cfg
This was tested in DEV environment only. Be sure to understand what your are doing.
# in the pool. 0 indicates no limit. default is 5 sql_alchemy_pool_size = 0 # max_overflow can be set to -1 to indicate no overflow limit; # no limit will be placed on the total number of concurrent connections. sql_alchemy_max_overflow = -1 # the max number of task instances that should run simultaneously # on this airflow installation parallelism = 64 # The number of task instances allowed to run concurrently by the scheduler # i suggest double the defaults after installation. dag_concurrency = 32 # The maximum number of active DAG runs per DAG max_active_runs_per_dag = 32 # How long before timing out a python file import # default is 30, i suggest 3000 for dynamic dag operators. dagbag_import_timeout = 3000 # How long before timing out a DagFileProcessor, which processes a dag file # default is 50 , i suggest 5000 dag_file_processor_timeout = 5000 # The scheduler can run multiple threads in parallel to schedule dags. # This defines how many threads will run. my default was 2. i suggest 4 times the #default. max_threads = 8 #consider change the heart rate scheduling of scheduler If the last scheduler heartbeat happened more than scheduler_health_check_threshold # ago (in seconds), scheduler is considered unhealthy. # This is used by the health check in the "/health" endpoint scheduler_health_check_threshold = 60 # scheduler_zombie_task_threshold should be higher then # scheduler_zombie_task_threshold scheduler_zombie_task_threshold = 120
# Task instances listen for external kill signal (when you clear tasks # from the CLI or the UI), this defines the frequency at which they should # listen (in seconds). job_heartbeat_sec = 4 # The scheduler constantly tries to trigger new tasks (look at the # scheduler section in the docs for more information). This defines # how often the scheduler should run (in seconds). scheduler_heartbeat_sec = 4
# solve airflor missing or disappearing by restarting the scheduler every 1 hour
run_duration = 3600
# also increase timeout for fetching logs
log_fetch_timeout_sec
= 30
More reading:
https://stackoverflow.com/questions/48567906/how-to-increase-tasks-queued-per-second
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.htmlhttps://www.distributedpython.com/2018/10/26/celery-execution-pool/
https://www.astronomer.io/
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,
feel free to contact me via LinkedIn: