Airflow Performance tuning in 5 min

Author: Omid Vahdaty 23.3.2019​

When you are using dynamic operators , the default settings will not work.I suggest the below settings.

Notice this configuration is provided AS IS.

Usually the config file is located in ~/airflow/airflow.cfg

This was tested in DEV environment only. Be sure to understand what your are doing.

# in the pool. 0 indicates no limit. default is 5
sql_alchemy_pool_size = 0
# max_overflow can be set to -1 to indicate no overflow limit;
# no limit will be placed on the total number of concurrent connections.  	
sql_alchemy_max_overflow = -1
 
# the max number of task instances that should run simultaneously
# on this airflow installation	
parallelism = 64
 
# The number of task instances allowed to run concurrently by the scheduler
# i suggest double the defaults after installation.
dag_concurrency = 32
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 32
# How long before timing out a python file import
# default is 30, i suggest 3000 for dynamic dag operators.
dagbag_import_timeout = 3000
# How long before timing out a DagFileProcessor, which processes a dag file
# default is 50 , i suggest 5000
dag_file_processor_timeout = 5000
# The scheduler can run multiple threads in parallel to schedule dags.
# This defines how many threads will run. my default was 2. i suggest 4 times the #default.
max_threads = 8
#consider change the heart rate scheduling of scheduler
 If the last scheduler heartbeat happened more than scheduler_health_check_threshold
# ago (in seconds), scheduler is considered unhealthy.
# This is used by the health check in the "/health" endpoint
scheduler_health_check_threshold = 60
# scheduler_zombie_task_threshold should be higher then 
# scheduler_zombie_task_threshold
scheduler_zombie_task_threshold = 120 
# Task instances listen for external kill signal (when you clear tasks
# from the CLI or the UI), this defines the frequency at which they should
# listen (in seconds).
job_heartbeat_sec = 4

# The scheduler constantly tries to trigger new tasks (look at the
# scheduler section in the docs for more information). This defines
# how often the scheduler should run (in seconds).
scheduler_heartbeat_sec = 4
# solve airflor missing or disappearing by restarting the scheduler every 1 hour  
run_duration = 3600
# also increase timeout for fetching logs
log_fetch_timeout_sec = 30 

More reading:

https://stackoverflow.com/questions/48567906/how-to-increase-tasks-queued-per-second
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.htmlhttps://www.distributedpython.com/2018/10/26/celery-execution-pool/

https://www.astronomer.io/blog/7-common-errors-to-check-when-debugging-airflow-daghttps://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#log-fetch-timeout-sec

 

——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

Leave a Reply