A plan is a set of changes that summarizes the difference between the local state of a project and the state of a target environment. In order for any model changes to take effect in a target environment, a plan needs to be created and applied.
During plan creation:
- The local state of the SQLMesh project is compared to the state of a target environment. The difference between the two and the actions needed to synchronize the environment with the local state are what constitutes a plan.
- Users may be prompted to categorize changes) to existing models so SQLMesh can determine what actions to take for indirectly affected models (the downstream models that depend on the updated models). By default, SQLMesh attempts to categorize changes automatically, but this behavior can be changed through configuration.
- Each plan requires a date range to which it will be applied. If not specified, the date range is derived automatically based on model definitions and the target environment.
The benefit of plans is that all changes can be reviewed and verified before they are applied to the data warehouse and any computations are performed. A typical plan contains a combination of the following:
- A list of added models
- A list of removed models
- A list of directly modified models and a text diff of changes that have been made
- A list of indirectly modified models
- Missing data intervals for affected models
- A date range that will be affected by the plan application
To create a new plan, run the following command:
If no environment name is specified, the plan is generated for the
Categories only need to be provided for models that have been modified directly. The categorization of indirectly modified downstream models is inferred based on the types of changes to the directly modified models.
If more than one upstream dependency of an indirectly modified model has been modified and they have conflicting categories, the most conservative category (breaking) is assigned to this model.
If a directly modified model change is categorized as breaking, it and its downstream dependencies will be backfilled.
In general, this is the safest option because it guarantees all downstream dependencies will reflect the change. However, it is a more expensive option because it involves additional data reprocessing, which has a runtime cost associated with it (refer to backfilling).
Choose this option when a change has been made to a model's logic that has a functional impact on its downstream dependencies. For example, adding or modifying a model's
WHERE clause is a breaking change because downstream models contain rows that would now be filtered out.
A directly-modified model that is classified as non-breaking will be backfilled, but its downstream dependencies will not.
This is a common choice in scenarios such as an addition of a new column, an action which doesn't affect downstream models, as new columns can't be used by downstream models without modifying them directly to select the column. If any downstream models contain a
select * from the model, SQLMesh attempts to infer breaking status on a best-effort basis. We recommend explicitly specifying a query's columns to avoid unnecessary recomputation.
A modified (either directly or indirectly) model that is categorized as forward-only will continue to use the existing physical table once the change is deployed to production (the
prod environment). This means that no backfill will take place.
While iterating on forward-only changes in the development environment, the model's output will be stored in either a temporary table or a shallow clone of the production table if supported by the engine. In either case the data produced this way in the development environment will not be reused once the change is deployed to production. See Forward-only Plans for more details.
This category is assigned by SQLMesh automatically either when a user opts into using a forward-only plan or when a model is explicitly configured to be forward-only.
Once a plan has been created and reviewed, it is then applied to the target environment in order for its changes to take effect.
Every time a model is changed as part of a plan, a new variant of this model gets created behind the scenes (a snapshot with a unique fingerprint is assigned to it). In turn, each model variant's data is stored in a separate physical table. Data between different variants of the same model is never shared, except for forward-only plans.
When a plan is applied to an environment, the environment gets associated with the set of model variants that are part of that plan. In other words, each environment is a collection of references to model variants and the physical tables associated with them.
Each model variant gets its own physical table while environments only contain references to these tables.
This unique approach to understanding and applying changes is what enables SQLMesh's Virtual Environments. This technology allows SQLMesh to ensure complete isolation between environments while allowing it to share physical data assets between environments when appropriate and safe to do so. Additionally, since each model change is captured in a separate physical table, reverting to a previous version becomes a simple and quick operation (refer to Virtual Update) as long as its physical table hasn't been garbage collected by the janitor process. SQLMesh makes it easy to be correct and really hard to accidentally and irreversibly break things.
Despite all the benefits, the approach described above is not without trade-offs. When a new model version is just created, a physical table assigned to it is empty. Therefore, SQLMesh needs to re-apply the logic of the new model version to the entire date range of this model in order to populate the new version's physical table. This process is called backfilling.
At the moment, we are using the term backfilling broadly to describe any situation in which a model is updated. That includes these operations:
- When a VIEW model is created
- When a FULL model is built
- When an INCREMENTAL model is built for the first time
- When an INCREMENTAL model has recent data appended to it
- When an INCREMENTAL model has older data inserted (i.e., resolving a data gap or prepending historical data)
We will be iterating on terminology to better capture the nuances of each type in future versions.
Note for incremental models: despite the fact that backfilling can happen incrementally (see
batch_size parameter on models), there is an extra cost associated with this operation due to additional runtime involved. If the runtime cost is a concern, a forward-only plan can be used instead.
Another benefit of the SQLMesh approach is that data for a new model version can be fully pre-built while still in a development environment. That way all changes and their downstream dependencies can be fully previewed before they are promoted to the production environment.
With this approach, the process of promoting a change to production is reduced to reference swapping. If during plan creation no data gaps have been detected and only references to new model versions need to be updated, then the update is referred to as a Virtual Update. Virtual Updates impose no additional runtime overhead or cost.
Sometimes the runtime cost associated with rebuilding an entire physical table is too high and outweighs the benefits a separate table provides. This is when a forward-only plan comes in handy.
When a forward-only plan is applied to the
prod environment, none of the plan's changed models will have new physical tables created for them. Instead, physical tables from previous model versions are reused. The benefit of this is that no backfilling is required, so there is no runtime overhead or cost. The drawback is that reverting to a previous version is no longer simple and requires a combination of additional forward-only changes and restatements.
Note that once a forward-only change is applied to
prod, all development environments that referred to the previous versions of the updated models will be impacted.
A core component of the development process is to execute code and verify its behavior. To enable this while preserving isolation between environments,
sqlmesh plan [environment name] evaluates code in non-
prod environments while targeting shallow (a.k.a. "zero-copy") clones of production tables for engines that support them or newly created temporary physical tables for engines that don't. This means that only a limited preview of changes is available in the development environment before the change is promoted to
prod. The date range of the preview is provided as part of plan creation command.
Engines for which table cloning is supported include:
Note that all changes made as part of a forward-only plan automatically get a forward-only category assigned to them. These types of changes can't be mixed together with breaking and non-breaking changes within the same plan.
To create a forward-only plan, add the
--forward-only option to the
--forward-only flag is not required when applying changes to models that have been explicitly configured as forward-only. Use it only if you need to provide a time range for the preview window or the effective date.
Changes that are part of the forward-only plan can also be applied retroactively to the production environment by specifying the effective date:
There are cases when models need to be re-evaluated for a given time range, even though changes may not have been made to those model definitions. This could be due to an upstream issue with a dataset defined outside the SQLMesh platform, or when a forward-only plan change needs to be applied retroactively to a bounded interval of historical data.
For this reason, the
plan command supports the
--restate-model option, which allows users to specify one or more names of a model to be reprocessed. Each name can also refer to an external table defined outside SQLMesh.
Application of a plan will trigger a cascading backfill for all specified models (other than external tables), as well as all models downstream from them. The plan's date range determines the data intervals that will be affected.
For example, this command creates a plan that restates the model
db.model_a and all its downstream dependencies, as well as all models that refer to the
external.table_a table and their downstream dependencies: