Overview
Models are comprised of metadata and queries that create tables and views, which can be used by other models or even outside of SQLMesh. They are defined in the models/
directory of your SQLMesh project and live in .sql
files.
SQLMesh will automatically determine the relationships among and lineage of your models by parsing SQL, so you don't have to worry about manually configuring dependencies.
Example
The following is an example of a model defined in SQL. Note the following aspects:
- Models can include descriptive information as comments, such as the first line.
- The first non-comment statement of a
model.sql
file is theMODEL
DDL. - The last non-comment statement should be a
SELECT
statement that defines the logic needed to create the table
Conventions
SQLMesh attempts to infer as much as possible about your pipelines through SQL alone to reduce the cognitive overhead of switching to another format such as YAML.
The SELECT
expression of a model must follow certain conventions for SQLMesh to detect the necessary metadata to operate.
Unique column names
The final SELECT
of a model's query must contain unique column names.
Explicit types
SQLMesh encourages explicit type casting in the final SELECT
of a model's query. It is considered a best practice to prevent unexpected types in the schema of a model's table.
SQLMesh uses the postgres x::int
syntax for casting; the casts are automatically transpiled to the appropriate format for the execution engine.
Inferrable names
The final SELECT
of a model's query must have inferrable names or aliases.
Explicit aliases are recommended, but not required. The SQLMesh formatter will automatically add aliases to columns without them when the model SQL is rendered.
This example demonstrates non-inferrable, inferrable, and explicit aliases:
Model properties
The MODEL
DDL statement takes various properties, which are used for both metadata and controlling behavior.
name
name
specifies the name of the model. This name represents the production view name that the model outputs, so it generally takes the form of"schema"."view_name"
. The name of a model must be unique in a SQLMesh project.
When models are used in non-production environments, SQLMesh automatically prefixes the names. For example, consider a model named"sushi"."customers"
. In production its view is named"sushi"."customers"
, and in dev its view is named"sushi__dev"."customers"
.
Name is required and must be unique.
kind
- Kind specifies what kind a model is. A model's kind determines how it is computed and stored. The default kind is
VIEW
, which means a view is created and your query is run each time that view is accessed. See below for properties that apply to incremental model kinds.
dialect
- Dialect defines the SQL dialect of the model. By default, this uses the dialect in the configuration file
model_defaults
dialect
key. All SQL dialects supported by the SQLGlot library are allowed.
owner
- Owner specifies who the main point of contact is for the model. It is an important field for organizations that have many data collaborators.
stamp
- An optional arbitrary string sequence used to create new model versions without making changes to any of the functional components of the definition.
start
- Start is used to determine the earliest time needed to process the model. It can be an absolute date/time (
2022-01-01
), or a relative one (1 year ago
).
cron
- Cron is used to schedule your model to process or refresh at a certain interval. It uses croniter under the hood, so expressions such as
@daily
can be used.
interval_unit
- Interval unit determines the granularity of data intervals for this model. By default the interval unit is automatically derived from the
cron
expression. Supported values are:year
,month
,day
,hour
,half_hour
,quarter_hour
, andfive_minute
.
tags
- Tags are one or more labels used to organize your models.
grain
- A model's grain is the column or combination of columns that uniquely identify a row in the results returned by the model's query. If the grain is set, SQLMesh tools like
table_diff
are simpler to run because they automatically use the model grain for parameters that would otherwise need to be specified manually.
grains
- A model can define multiple grains if it has more than one unique key or combination of keys.
references
- References are non-unique columns or combinations of columns that identify a join relationship to an entity. For example, a model could define a reference
account_id
, which would indicate that it can now automatically join to any model with anaccount_id
grain. It cannot safely join to a table with anaccount_id
reference because references are not unique and doing so would constitute a many-to-many join. Sometimes columns are named differently, in that case you can alias column names to a common entity name. For exampleguest_id AS account_id
would allow a model with the column guest_id to join to a model with the grain account_id.
storage_format
- Storage format is a property for engines such as Spark or Hive that support storage formats such as
parquet
andorc
.
partitioned_by
- Partitioned by is an optional property for engines such as Spark or BigQuery that support partitioning. Use this to specify a multi-column partition key or to modify a date column for partitioning. For example, in BigQuery you could partition by day by extracting the day component of a timestamp column
event_ts
withpartitioned_by TIMESTAMP_TRUNC(event_ts, DAY)
.
clustered_by
- Clustered by is an optional property for engines such as Bigquery that support clustering.
table_properties
- A key-value of arbitrary table properties specific to the target engine. For example:
allow_partials
- Indicates that this model can be executed for partial (incomplete) data intervals. By default, each model processes only complete intervals to prevent common mistakes caused by partial data. The size of the interval is determined by the model's interval_unit. Setting
allow_partials
totrue
overrides this behavior, indicating that the model may process a segment of input data that is missing some of the data points. Please note that setting this attribute totrue
results in the disregard of the cron attribute.
Incremental Model Properties
For models that are incremental, the following parameters can be specified in the kind
's definition.
time_column
- Time column is a required property for incremental models. It is used to determine which records to overwrite when doing an incremental insert. Time column can have an optional format string specified in the SQL dialect of the model.
- Engines that support partitioning, such as Spark and BigQuery, use the time column as the model's partition key. Multi-column partitions or modifications to columns can be specified with the
partitioned_by
property.
lookback
- Lookback is used for incremental models to capture late arriving data. This must be a positive integer and refers to the number of units that late arriving data is expected.
batch_size
- Batch size is used to optimize backfilling incremental data. It determines the maximum number of intervals to run in a single job. For example, if a model specifies a cron of
@hourly
and a batch_size of12
, when backfilling 3 days of data, the scheduler will spawn 6 jobs. (3 days * 24 hours/day = 72 hour intervals to fill. 72 intervals / 12 intervals per job = 6 jobs.)
forward_only
- Set this to true to indicate that all changes to this model should be forward-only.
disable_restatement
- Set this to true to indicate that data restatement is disabled for this model.
Macros
Macros can be used for passing in parameterized arguments such as dates, as well as for making SQL less repetitive. By default, SQLMesh provides several predefined macro variables that can be used. Macros are used by prefixing with the @
symbol. For more information, refer to macros.
Statements
Models can have additional statements that run before and/or after the main query. They can be useful for loading things such as UDFs or cleaning up after a model query has run.
In general, pre-statements statements should only be used for preparing the main query. They should not be used for creating or altering tables, as this could lead to unpredictable behavior if multiple models are running simultaneously.
Additional statements can also be provided after the main query, in which case they will run after each evaluation of the SELECT query. Note that the model query must end with a semi-colon prior to the post-statements.
Time column
Models that are loaded incrementally require a time column to partition data.
A time column is a column in a model with an optional format string in the dialect of the model; for example, '%Y-%m-%d'
for DuckDB or 'yyyy-mm-dd'
for Snowflake. For more information, refer to time column.
Advanced usage
The column used as your model's time column is not limited to a text or date type. In the following example, the time column, di
, is an integer: