3 min readJan 20, 2023
Airflow is a powerful tool for managing and scheduling data pipelines, but to get the most out of it, it’s important to follow best practices when using the platform. In this article, we’ll discuss some of the best practices for using Airflow to ensure that your data pipelines are efficient, reliable, and easy to maintain.
- First and foremost, it’s important to minimize the use of Python operators. While they can be useful in some cases, most data pipeline tasks can be accomplished using the built-in operators that come with Airflow. This will help to keep your codebase clean and easy to understand.
- Another important best practice is to avoid using pandas for data transformations. Instead, use BigQuery or any other tool, which are specifically designed for this purpose and can handle large data sets more efficiently.
- When running heavy calculations that you can’t do on the build-in operators, don’t use PythonOperator or BashOperator. Instead, use KubernetesPodOperator, which allows you to run calculations in a containerised environment. This will ensure that your calculations are performed in a consistent and isolated environment, which can improve the performance and scalability of your pipeline.
- To make your code more reusable, try to wrap common processes in functions. This will allow you to easily reuse these processes in future DAGs, which can save you time and make your pipeline more maintainable.
- Consistency is key when naming your DAGs. Use the same name for the DAG and the file that contains the DAG definition. This will make it easier to find and understand your DAGs. Hint: you can make a function that can return the name of the DAG from __file__.
- Don’t define SQL queries in the DAG file. Instead, use separate files for queries, so that they can be easily reused and updated. This will make it easier to test and maintain your queries, which can improve the reliability of your pipeline.
- When working with date and time, don’t use the SQL NOW() function or Python’s now() function. Instead, use the execution_date variable in Airflow, which is specifically designed to provide the current date and time during DAG execution.
- Try to parametrize your queries as much as possible. This will make it easier to update and test your queries without having to modify the DAG file, which can improve the maintainability of your pipeline.
- Treat your DAG file as a configuration definition. This will make it easier to understand what your pipeline is doing and how it’s configured, which can improve its maintainability.
- For better monitoring, set Service Level Agreement (SLA) for your DAGs. You can choose the number like (1.5 * avg duration) then increase of decrease it during the usage. This will allow you to keep track of your pipeline’s performance and quickly identify any issues that arise.
- Please avoid of making your own Domain Specific Language (DSL) in YAML or JSON on top of Airflow DAGs, because the DAG file is a declaration of the process already. Making additional layer of abstraction makes it hard to support in the future.
- Try to avoid cross dependencies between the DAGs. Once the amount of pipeline will be quite big it’s hard to maintain the DAGs.
- Try to avoid “Monster DAG” it’s usually happened when you have similar pipelines that runs in parallel inside of 1 single DAG. For example you transform several tables and you just loop over the list of tables and generate pipelines for each table inside one single DAG. It’s better to group such pipelines logically and split to several DAGs. This will make it easier to tweak your DAGS and monitor them, which can improve its maintainability.
- Add versioning to your DAGs. The synchronisation process with airflow could take some time to sync the DAGs, so it’s better to have some version of DAGs in order to understand that your DAGs are synced. I personally add dummy DAG with git-short-sha suffix in the name. For example the name of the DAG: version-fa45b25.
In conclusion, it’s small list of technics that can help you to make your Airflow usage much simpler.
If you have any questions just reach me out, I can help you solve your problems.