Configuration guide
SQLMesh's behavior is determined by three things: a project's files (e.g., models), user actions (e.g., creating a plan
), and how SQLMesh is configured.
This page describes how SQLMesh configuration works and discusses the aspects of SQLMesh behavior that can be modified via configuration.
The configuration reference page contains concise lists of all configuration parameters and their default values.
Configuration files
NOTE: SQLMesh project configurations have the following two requirements:
- A
config.yaml
orconfig.py
file must be present in the project's folder. - That configuration file must contain a default SQL dialect for the project's models in the
model_defaults
dialect
key.
SQLMesh configuration parameters can be set as environment variables, in a configuration file in the ~/.sqlmesh
folder, and in the configuration file within a project folder.
The sources have the following order of precedence:
- Environment variable (e.g.,
SQLMESH__MODEL_DEFAULTS__DIALECT
). [HIGHEST PRECEDENCE] config.yaml
orconfig.py
in the~/.sqlmesh
folder.config.yaml
orconfig.py
in a project folder. [LOWEST PRECEDENCE]
File type
You can specify a SQLMesh configuration in either YAML or Python.
YAML configuration is simpler, and we recommend it for most projects. Python configuration is more complex, but it enables functionality that YAML does not support.
Because Python configuration files are evaluated by Python when SQLMesh reads them, they support dynamic parameters based on the computational environment in which SQLMesh is running.
For example, Python configuration files enable use of third-party secrets managers for storing passwords and other sensitive information. They also support user-specific parameters such as automatically setting project defaults based on which user account is running SQLMesh.
YAML
YAML configuration files consist of configuration keys and values. Strings are not quoted, and some keys are "dictionaries" that contain one or more sub-keys.
For example, the default_gateway
key specifies the default gateway SQLMesh should use when executing commands. It takes a single, unquoted gateway name as its value:
In contrast, the gateways
key takes dictionaries as values, and each gateway dictionary contains one or more connection dictionaries. This example specifies the my_gateway
gateway with a Snowflake connection
:
Gateway dictionaries can contain multiple connection dictionaries if different SQLMesh components should use different connections (e.g., SQLMesh test
s should run in a different database than SQLMesh plan
s). See the gateways section for more information on gateway configuration.
Python
Python configuration files consist of statements that import SQLMesh configuration classes and a configuration specification using those classes.
At minimum, a Python configuration file must:
- Create an object of the SQLMesh
Config
class namedconfig
- Specify that object's
model_defaults
argument with aModelDefaultsConfig()
object specifying the default SQL dialect for the project's models
For example, this minimal configuration specifies a default SQL dialect of duckdb
and uses the default values for all other configuration parameters:
Python configuration files may optionally define additional configuration objects and switch between the configurations when issuing sqlmesh
commands. For example, if a configuration file contained a second configuration object my_second_config
, you could create a plan using that config with sqlmesh --config my_second_config plan
.
Different Config
arguments accept different object types. Some, such as model_defaults
, take SQLMesh configuration objects. Others, such as default_gateway
, take strings or other Python object types like dictionaries.
SQLMesh's Python configuration components are documented in the sqlmesh.core.config
module's API documentation.
The config
sub-module API documentation describes the individual classes used for the relevant Config
arguments:
- Model defaults configuration:
ModelDefaultsConfig()
- Gateway configuration:
GatewayConfig()
- Connection configuration (separate classes for each supported database/engine)
- Scheduler configuration (separate classes for each supported scheduler)
- Plan change categorization configuration:
CategorizerConfig()
- User configuration:
User()
- Notification configuration (separate classes for each notification target)
See the notifications guide for more information about user and notification specification.
Environment variables
All software runs within a system environment that stores information as "environment variables."
SQLMesh can access environment variables during configuration, which enables approaches like storing passwords/secrets outside the configuration file and changing configuration parameters dynamically based on which user is running SQLMesh.
You can use environment variables in two ways: specifying them in the configuration file or creating properly named variables to override configuration file values.
Configuration file
This section demonstrates using environment variables in YAML and Python configuration files.
The examples specify a Snowflake connection whose password is stored in an environment variable SNOWFLAKE_PW
.
Specify environment variables in a YAML configuration with the syntax {{ env_var('<ENVIRONMENT VARIABLE NAME>') }}
. Note that the environment variable name is contained in single quotes.
Access the SNOWFLAKE_PW
environment variable in a Snowflake connection configuration like this:
Python accesses environment variables via the os
library's environ
dictionary.
Access the SNOWFLAKE_PW
environment variable in a Snowflake connection configuration like this:
Overrides
Environment variables have the highest precedence among configuration methods, as noted above. They will automatically override configuration file specifications if they follow a specific naming structure.
The structure is based on the names of the configuration fields, with double underscores __
between the field names. The environment variable name must begin with SQLMESH__
, followed by the YAML field names starting at the root and moving downward in the hierarchy.
For example, we can override the password specified in a Snowflake connection. This is the YAML specification contained in our configuration file, which specifies a password dummy_pw
:
We can override the dummy_pw
value with the true password real_pw
by creating the environment variable. This example demonstrates creating the variable with the bash export
function:
After the initial string SQLMESH__
, the environment variable name components move down the key hierarchy in the YAML specification: GATEWAYS
--> MY_GATEWAY
--> CONNECTION
--> PASSWORD
.
Configuration types
A SQLMesh project configuration is hierarchical and consists of root level parameters within which other parameters are defined.
Conceptually, we can group the root level parameters into the following types. Each type links to its table of parameters in the SQLMesh configuration reference page:
- Project - configuration options for SQLMesh project directories.
- Environment - configuration options for SQLMesh environment creation/promotion, physical table schemas, and view schemas.
- Gateways - configuration options for how SQLMesh should connect to the data warehouse, state backend, and scheduler.
- Gateway/connection defaults - configuration options for what should happen when gateways or connections are not all explicitly specified.
- Model defaults - configuration options for what should happen when model-specific configurations are not explicitly specified in a model's file.
- Debug mode - configuration option for SQLMesh to print and log actions and full backtraces.
Configuration details
The rest of this page provides additional detail for some of the configuration options and provides brief examples. Comprehensive lists of configuration options are at the configuration reference page.
Table/view storage locations
SQLMesh creates schemas, physical tables, and views in the data warehouse/engine. Learn more about why and how SQLMesh creates schema in the "Why does SQLMesh create schemas?" FAQ.
The default SQLMesh behavior described in the FAQ is appropriate for most deployments, but you can override where SQLMesh creates physical tables and views with the physical_schema_mapping
, environment_suffix_target
, and environment_catalog_mapping
configuration options. These options are in the environments section of the configuration reference page.
Physical table schemas
By default, SQLMesh creates physical tables for a model with a naming convention of sqlmesh__[model schema]
.
This can be overridden on a per-schema basis using the physical_schema_mapping
option, which removes the sqlmesh__
prefix and uses the regex pattern you provide to map the schemas defined in your model to their corresponding physical schemas.
This example configuration overrides the default physical schemas for the my_schema
model schema and any model schemas starting with dev
:
This config causes the following mapping behaviour:
Model name | Default physical location | Resolved physical location |
---|---|---|
my_schema.my_table |
sqlmesh__my_schema.table_<fingerprint> |
my_new_schema.table_<fingerprint> |
dev_schema.my_table |
sqlmesh__dev_schema.table_<fingerprint> |
development.table_<fingerprint> |
other.my_table |
sqlmesh__other.table_<fingerprint> |
sqlmesh__other.table_<fingerprint> |
This only applies to the physical tables that SQLMesh creates - the views are still created in my_schema
(prod) or my_schema__<env>
.
Disable environment-specific schemas
SQLMesh stores prod
environment views in the schema in a model's name - for example, the prod
views for a model my_schema.users
will be located in my_schema
.
By default, for non-prod environments SQLMesh creates a new schema that appends the environment name to the model name's schema. For example, by default the view for a model my_schema.users
in a SQLMesh environment named dev
will be located in the schema my_schema__dev
.
This behavior can be changed to append a suffix at the end of a table/view name instead. Appending the suffix to a table/view name means that non-prod environment views will be created in the same schema as the prod
environment. The prod and non-prod views are differentiated by non-prod view names ending with __<env>
.
For example, if you created a dev
environment for a project containing a model named my_schema.users
, the model view would be created as my_schema.users__dev
instead of the default behavior of my_schema__dev.users
.
Config example:
The Python environment_suffix_target
argument takes an EnvironmentSuffixTarget
enumeration with a value of EnvironmentSuffixTarget.TABLE
or EnvironmentSuffixTarget.SCHEMA
(default).
The default behavior of appending the suffix to schemas is recommended because it leaves production with a single clean interface for accessing the views. However, if you are deploying SQLMesh in an environment with tight restrictions on schema creation then this can be a useful way of reducing the number of schemas SQLMesh uses.
Environment view catalogs
By default, SQLMesh creates an environment view in the same catalog as the physical table the view points to. The physical table's catalog is determined by either the catalog specified in the model name or the default catalog defined in the connection.
Some companies fully segregate prod
and non-prod environment objects by catalog. For example, they might have a "prod" catalog that contains all prod
environment physical tables and views and a separate "dev" catalog that contains all dev
environment physical tables and views.
Separate prod and non-prod catalogs can also be useful if you have a CI/CD pipeline that creates environments, like the SQLMesh Github Actions CI/CD Bot. You might want to store the CI/CD environment objects in a dedicated catalog since there can be many of them.
To configure separate catalogs, provide a mapping from regex patterns to catalog names. SQLMesh will compare the name of an environment to the regex patterns; when it finds a match it will store the environment's objects in the corresponding catalog.
SQLMesh evaluates the regex patterns in the order defined in the configuration; it uses the catalog for the first matching pattern. If no match is found, the catalog defined in the model or the default catalog defined on the connection will be used.
Config example:
With the example configuration above, SQLMesh would evaluate environment names as follows:
- If the environment name is
prod
, the catalog will beprod
. - If the environment name starts with
dev
, the catalog will bedev
. - If the environment name starts with
analytics_repo
, the catalog will becicd
.
Note: This feature is only available for engines that support querying across catalogs. At the time of writing, the following engines are NOT supported:
Regex Tips
- If you are less familiar with regex, you can use a tool like regex101 to help you build your regex patterns.
- LLMs, like ChatGPT, can help with generating regex patterns. Make sure to validate the suggestion in regex101.
- If you are wanting to do an exact word match then surround it with
^
and$
like in the example above. - If you want a catch-all at the end of your mapping, to avoid ever using the model catalog or default catalog, then use
.*
as the pattern. This will match any environment name that hasn't already been matched.
Auto-categorize model changes
SQLMesh compares the current state of project files to an environment when sqlmesh plan
is run. It detects changes to models, which can be classified as breaking or non-breaking.
SQLMesh can attempt to automatically categorize the changes it detects. The plan.auto_categorize_changes
option determines whether SQLMesh should attempt automatic change categorization. This option is in the environments section of the configuration reference page.
Supported values:
full
: Never prompt the user for input, instead fall back to the most conservative category (breaking) if the category can't be determined automatically.semi
: Prompt the user for input only if the change category can't be determined automatically.off
: Always prompt the user for input; automatic categorization will not be attempted.
Example showing default values:
The Python auto_categorize_changes
argument takes CategorizerConfig
object. That object's arguments take an AutoCategorizationMode
enumeration with values of AutoCategorizationMode.FULL
, AutoCategorizationMode.SEMI
, or AutoCategorizationMode.OFF
.
Gateways
The gateways
configuration defines how SQLMesh should connect to the data warehouse, state backend, and scheduler. These options are in the gateway section of the configuration reference page.
Each gateway key represents a unique gateway name and configures its connections. For example, this configures the my_gateway
gateway:
The Python gateways
argument takes a dictionary of gateway names and GatewayConfig
objects. A GatewayConfig
's connection-related arguments take an engine-specific connection config object, and the scheduler
argument takes a scheduler config object.
Gateways do not need to specify all four components in the example above. The gateway defaults options control what happens if they are not all specified - find more information on gateway defaults below.
Connections
The connection
configuration controls the data warehouse connection. These options are in the connection section of the configuration reference page.
The allowed keys include:
- The optional
concurrent_tasks
key specifies the maximum number of concurrent tasks SQLMesh will run. Default value is 4 for engines that support concurrent tasks. - Most keys are specific to the connection engine
type
- see below. The default data warehouse connection type is an in-memory DuckDB database.
Example snowflake connection configuration:
A Snowflake connection is specified with a SnowflakeConnectionConfig
object.
Engine connection configuration
These pages describe the connection configuration options for each execution engine.
- Athena
- BigQuery
- Databricks
- DuckDB
- MotherDuck
- MySQL
- MSSQL
- Postgres
- GCP Postgres
- Redshift
- Snowflake
- Spark
- Trino
State connection
Configuration for the state backend connection if different from the data warehouse connection.
The data warehouse connection is used to store SQLMesh state if the state_connection
key is not specified, unless the configuration uses an Airflow or Google Cloud Composer scheduler. If using one of those schedulers, the scheduler's database is used (not the data warehouse) unless an Airflow Connection has been configured.
Unlike data transformations, storing state information requires database transactions. Data warehouses aren’t optimized for executing transactions, and storing state information in them can slow down your project or produce corrupted data due to simultaneous writes to the same table. Therefore, production SQLMesh deployments should use a dedicated state connection.
Note
Using the same connection for data warehouse and state is not recommended for production deployments of SQLMesh.
The easiest and most reliable way to manage your state connection is for Tobiko Cloud to do it for you. If you'd rather handle it yourself, we list recommended and unsupported state engines below.
Recommended state engines for production deployments:
Other state engines with fast and reliable database transactions (less tested than the recommended engines):
- DuckDB
- Does not support concurrency and may error if the primary connection executes with concurrent tasks (its connection configuration's
concurrent_tasks
is greater than 1)
- Does not support concurrency and may error if the primary connection executes with concurrent tasks (its connection configuration's
- MySQL
- MSSQL
Unsupported state engines, even for development:
This example gateway configuration uses Snowflake for the data warehouse connection and Postgres for the state backend connection:
A Postgres connection is specified with a PostgresConnectionConfig
object.
State schema name
By default, the schema name used to store state tables is sqlmesh
. This can be changed by providing the state_schema
config key in the gateway configuration.
Example configuration to store state information in a postgres database's custom_name
schema:
This would create all state tables in the schema custom_name
.
Test connection
Configuration for a connection used to run unit tests. An in-memory DuckDB database is used if the test_connection
key is not specified.
A DuckDB connection is specified with a DuckDBConnectionConfig
object. A DuckDBConnectionConfig
with no arguments specified uses an in-memory DuckDB database.
Scheduler
Identifies which scheduler backend to use. The scheduler backend is used both for storing metadata and for executing plans. By default, the scheduler type is set to builtin
, which uses the existing SQL engine to store metadata. Use the airflow
type integrate with Airflow.
These options are in the scheduler section of the configuration reference page.
Builtin
Example configuration:
A built-in scheduler is specified with a BuiltInSchedulerConfig
object.
No additional configuration options are supported by this scheduler type.
Airflow
Example configuration:
An Airflow scheduler is specified with an AirflowSchedulerConfig
object.
See Airflow Integration Guide for information about how to integrate Airflow with SQLMesh. See the configuration reference page for a list of all parameters.
Cloud Composer
The Google Cloud Composer scheduler type shares the same configuration options as the airflow
type, except for username
and password
. Cloud Composer relies on gcloud
authentication, so the username
and password
options are not required.
Example configuration:
An Google Cloud Composer scheduler is specified with an CloudComposerSchedulerConfig
object.
Gateway/connection defaults
The default gateway and connection keys specify what should happen when gateways or connections are not explicitly specified. These options are in the gateway/connection defaults section of the configuration reference page.
The gateway specified in default_gateway
is used when a sqlmesh
command does not explicitly specify a gateway. All SQLMesh CLI commands accept a gateway option after sqlmesh
and before the command name; for example, sqlmesh --gateway my_gateway plan
. If the option is not specified in a command call, the default_gateway
is used.
The three default connection types are used when some gateways in the gateways
configuration dictionaries do not specify every connection type.
Default gateway
If a configuration contains multiple gateways, SQLMesh will use the first one in the gateways
dictionary by default. The default_gateway
key is used to specify a different gateway name as the SQLMesh default.
Example configuration:
Default connections/scheduler
The default_connection
, default_test_connection
, and default_scheduler
keys are used to specify shared defaults across multiple gateways.
For example, you might have a specific connection where your tests should run regardless of which gateway is being used. Instead of duplicating the test connection information in each gateway specification, specify it once in the default_test_connection
key.
Example configuration specifying a Postgres default connection, in-memory DuckDB default test connection, and builtin default scheduler:
Models
Model defaults
The model_defaults
key is required and must contain a value for the dialect
key. All SQL dialects supported by the SQLGlot library are allowed. Other values are set automatically unless explicitly overridden in the model definition.
All supported model_defaults
keys are listed in the models configuration reference page.
Example configuration:
The default model kind is VIEW
unless overridden with the kind
key. For more information on model kinds, refer to model concepts page.
Identifier resolution
When a SQL engine receives a query such as SELECT id FROM "some_table"
, it eventually needs to understand what database objects the identifiers id
and "some_table"
correspond to. This process is usually referred to as identifier (or name) resolution.
Different SQL dialects implement different rules when resolving identifiers in queries. For example, certain identifiers may be treated as case-sensitive (e.g. if they're quoted), and a case-insensitive identifier is usually either lowercased or uppercased, before the engine actually looks up what object it corresponds to.
SQLMesh analyzes model queries so that it can extract useful information from them, such as computing Column-Level Lineage. To facilitate this analysis, it normalizes and quotes all identifiers in those queries, respecting each dialect's resolution rules.
The "normalization strategy", i.e. whether case-insensitive identifiers are lowercased or uppercased, is configurable per dialect. For example, to treat all identifiers as case-sensitive in a BigQuery project, one can do:
This may be useful in cases where the name casing needs to be preserved, since then SQLMesh won't be able to normalize them.
See here to learn more about the supported normalization strategies.
Model Kinds
Model kinds are required in each model file's MODEL
DDL statement. They may optionally be used to specify a default kind in the model defaults configuration key.
All model kind specification keys are listed in the models configuration reference page.
The VIEW
, FULL
, and EMBEDDED
model kinds are specified by name only, while other models kinds require additional parameters and are provided with an array of parameters:
FULL
model only requires a name:
INCREMENTAL_BY_TIME_RANGE
requires an array specifying the model's time_column
(which should be in the UTC time zone):
Python model kinds are specified with model kind objects. Python model kind objects have the same arguments as their SQL counterparts, listed in the models configuration reference page.
This example demonstrates how to specify an incremental by time range model kind in Python:
Learn more about specifying Python models at the Python models concepts page.
Model Naming
The model_naming
configuration controls if model names are inferred based on the project's directory structure. If model_naming
is not defined or infer_names
is set to false, the model names must be provided explicitly.
With infer_names
set to true, model names are inferred based on their path. For example, a model located at models/catalog/schema/model.sql
would be named catalog.schema.model
. However, if a name is provided in the model definition, it will take precedence over the inferred name.
Example enabling name inference:
Debug mode
To enable debug mode set the SQLMESH_DEBUG
environment variable to one of the following values: "1", "true", "t", "yes" or "y".
Enabling this mode ensures that full backtraces are printed when using CLI. The default log level is set to DEBUG
when this mode is enabled.
Example enabling debug mode for the CLI command sqlmesh plan
: