SQLMesh provides first-class support for Airflow with the following capabilities:
- A Directed Acyclic Graph (DAG) generated dynamically for each model version. Each DAG accounts for all its upstream dependencies defined within SQLMesh, and only runs after upstream DAGs succeed for the time period being processed.
- Each plan application leads to the creation of a dynamically-generated DAG dedicated specifically to that Plan.
- The Airflow Database Backend is used for persistence of the SQLMesh state, meaning no external storage or additional configuration is required for SQLMesh to work.
- The janitor DAG runs periodically and automatically to clean up DAGs and other SQLMesh artifacts that are no longer needed.
- Support for any SQL engine can be added by providing a custom Airflow Operator.
Airflow cluster configuration
To enable SQLMesh support on a target Airflow cluster, the SQLMesh package should first be installed on that cluster. Ensure it is installed with the extras for your engine if needed; for example:
sqlmesh[databricks] for Databricks. Check setup.py for a list of extras.
Note: The Airflow Webserver instance(s) must be restarted after installation and every time the SQLMesh package is upgraded.
Once the package is installed, the following Python module must be created in the
dags/ folder of the target DAG repository with the following contents:
Note: The name of the engine operator is the only mandatory parameter needed for
sqlmesh.schedulers.airflow.integration.SQLMeshAirflow. Currently supported engines are listed in the Engine support section.
By default, SQLMesh uses the Airflow's database connection to read and write its state.
To configure a different storage backend for the SQLMesh state you need to create a new Airflow Connection with ID
sqlmesh_state_db and type
Generic. The configuration should be provided in the connection's
extra field in JSON format.
Refer to the Connection Configuration for supported fields.
SQLMesh client configuration
In your SQLMesh repository, create the following configuration within config.yaml:
Sometimes there is a need to postpone the model evaluation until certain external conditions are met.
For example, a model might refer to an external table and should only be evaluated when the data actually lands upstream. This can be achieved using external signals.
Signals are defined as part of the model's definition using arbitrary key-value pairs. Additionally,
@end_* macros can be used within these values. The macro values will be resolved accordingly at the time of evaluation.
Note that in the example above,
hour are arbitrary keys defined by the user.
Now, as part of the SQLMesh integration module, a function needs to be passed into the
SQLMeshAirflow constructor. This function should accept signal payload and return an Airflow Sensor instance representing this signal.
create_external_sensor function in the example above takes the
signal dictionary as an argument and returns an instance of
BaseSensorOperator. The keys in the signal dictionary match the keys provided in the model definition.
SQLMesh supports a variety of engines in Airflow. Support for each engine is provided by a custom Airflow operator implementation. Below is a list of links to operators supported out of the box with information on how to configure them.
Managed Airflow instances
Multiple companies offer managed Airflow instances that integrate with their products. This section describes SQLMesh support for some of the options.
Google Cloud Composer
Astronomer provides managed Airflow instances running on AWS, GCP, and Azure. SQLMesh fully supports Airflow hosted by Astronomer.
Additional dependencies need to be installed:
Additionally, the scheduler needs to be configured accordingly: