Table Diff Guide
SQLMesh's table diff tool allows you to compare the schema and data of two data objects. It supports comparison of a SQLMesh model across two environments or direct comparison of database tables or views.
It provides a method of validating models that can be used along with evaluating a model and testing a model with unit tests.
Note: Table diff requires the two objects to already exist in your project's underlying database or engine. If comparing models, this means you should have already planned and applied your changes to an environment.
Table diff comparisons
Table diff executes two types of comparison on the source and target objects: a schema diff and a row diff.
The schema diff identifies whether fields have been added, removed, or changed data types in the target object relative to the source object.
The row diff identifies changes in data values across columns with the same name and data type in both tables. It does this by performing an OUTER JOIN
of the two tables then, for each column with the same name and data type, comparing data values from one table to those from the other.
The table diff tool can be called in two ways: comparison of a SQLMesh model across two project environments or direct comparison of tables/views. It executes the comparison using the database or engine specified in the SQLMesh project configuration.
Diffing models across environments
Compare a SQLMesh model across environments with the SQLMesh CLI interface by using the command sqlmesh table_diff [source environment]:[target environment] [model name]
.
For example, we could make two modifications to the SQLMesh quickstart model sqlmesh_example.incremental_model
:
- Change the row whose
item_id
is3
to4
with aCASE WHEN
statement - Remove row whose
item_id
is1
by adding aWHERE
clause
After running sqlmesh plan dev
and applying the plan, the updated model will be present in the dev
environment but not in prod
.
Compare the two versions of the model with the table diff tool by running sqlmesh table_diff prod:dev sqlmesh_example.incremental_model
.
The first argument prod:dev
specifies that prod
is the source environment to which we will compare the target environment dev
. The second argument sqlmesh_example.incremental_model
is the name of the model to compare across the prod
and dev
environments.
Because the grain
is set to [id, ds]
in the MODEL
statement, SQLMesh knows how to perform the join between the two models. If grain
were not set, the command would need to include the -o id -o ds
option to specify that the tables should be joined on column id
and ds
. Specify -o
once for each join column.
Table diff returns this output:
The "Schema Diff" section shows that the PROD
and DEV
schemas match because no columns have been added, removed, or change data type.
The "Row Counts" section shows that 6 rows were successfully joined and the 1 row we removed is only present in the PROD
model.
The COMMON ROWS column comparison stats
section shows that the item_id
column values had an 83.3% match for the six joined rows (5 of the 6 row values were unchanged by our CASE WHEN
statement). All non-join columns with the same data type in both tables are included in the comparison stats.
If we include the --show-sample
option in the command, the output also includes rows from the different join components.
The COMMON ROWS sample data differences
section displays the row whose item_id
value changed. The PROD__item_id
column shows that item_id
is 3 in the PROD
table, and the DEV__item_id
column shows that item_id
is 4.0 in the DEV
table.
The PROD ONLY sample rows
section shows the one row that is present in PROD
but not in DEV
.
If we add the --skip-grain-check
option, the grain is not validated. By default without this flag, a warning is displayed to the user if rows contain null or duplicate grains.
Under the hood, SQLMesh stores temporary data in the database to perform the comparison.
The default schema for these temporary tables is sqlmesh_temp
but can be changed with the --temp-schema
option.
The schema can be specified as a CATALOG.SCHEMA
or SCHEMA
.
Diffing multiple models across environments
SQLMesh allows you to compare multiple models across environments at once using model selection expressions. This is useful when you want to validate changes across a set of related models or the entire project.
To diff multiple models, use the --select-model
(or -m
for short) option with the table diff command:
When diffing multiple models, SQLMesh will:
- Show the models returned by the selector that exist in both environments and have differences
- Compare these models and display the data diff of each model
Note: Models will only be data diffed if there's a breaking change that impacts them.
The --select-model
option supports a powerful selection syntax that lets you choose models using patterns, tags, dependencies and git status. For complete details, see the model selection guide.
Note: Surround your selection pattern in single or double quotes. Ex:
'*'
,"sqlmesh_example.*"
Here are some common examples:
# Select all models in a schema
sqlmesh table_diff prod:dev -m "sqlmesh_example.*"
# Select a model and its dependencies
sqlmesh table_diff prod:dev -m "+model_name" # include upstream deps
sqlmesh table_diff prod:dev -m "model_name+" # include downstream deps
# Select models by tag
sqlmesh table_diff prod:dev -m "tag:finance"
# Select models with git changes
sqlmesh table_diff prod:dev -m "git:feature"
# Use logical operators for complex selections
sqlmesh table_diff prod:dev -m "(metrics.* & ^tag:deprecated)" # models in the metrics schema that aren't deprecated
# Combine multiple selectors
sqlmesh table_diff prod:dev -m "tag:finance" -m "metrics.*_daily"
When multiple selectors are provided, they are combined with OR logic, meaning a model matching any of the selectors will be included.
Note
All models being compared must have their grain
defined that is unique and not null, as this is used to perform the join between the tables in the two environments.
If the --warn-grain-check
option is used, this requirement is not enforced. Instead of raising an error, a warning is displayed for the models without a defined grain and diffs are computed for the remaining models.
Diffing tables or views
Compare specific tables or views with the SQLMesh CLI interface by using the command sqlmesh table_diff [source table]:[target table]
.
The source and target tables should be fully qualified with catalog or schema names such that a SQL query of the form SELECT ... FROM [source table]
would execute correctly.
Recall that SQLMesh models are accessible via views in the database. In the prod
environment, the view has the same name as the model. For example, in the quickstart example project the prod
incremental model is represented by the view sqlmesh_example.incremental_model
. In the dev
environment, __dev
is appended to the schema name so the incremental model is represented by the view sqlmesh_example__dev.incremental_model
.
We can replicate the comparison in the previous section by comparing the model views directly. Because we are passing the view names directly, the command needs to manually specify that the join should be on the id
and ds
columns with the -o id -o ds
flags.
The output matches, with the exception of the column labels in the COMMON ROWS sample data differences
. The underlying table for each column is indicated by s__
for "source" table (first table in the command's colon operator :
) and t__
for "target" table (second table in the command's colon operator :
).
Diffing tables or views across gateways
Tobiko Cloud Feature
Cross-database table diffing is available in Tobiko Cloud.
SQLMesh executes a project's models with a single database system, specified as a gateway in the project configuration.
The within-database table diff tool described above compares tables or environments within such a system. Sometimes, however, you might want to compare tables that reside in two different data systems.
For example, you might migrate your data transformations from an on-premises SQL engine to a cloud SQL engine while setting up your SQLMesh project. To demonstrate equivalence between the systems you could run the transformations in both and compare the new tables to the old tables.
The within-database table diff tool cannot make those comparisons, for two reasons:
- It must join the two tables being diffed, but with two systems no single database engine can access both tables.
- It assumes that data values can be compared across tables without modification. If the systems use different SQL engines, however, the diff must account for differences in the engines' data types (e.g., whether timestamps should include time zone information).
SQLMesh's cross-database table diff tool is built for just this scenario. Its comparison algorithm efficiently diffs tables without moving them from one system to the other and automatically addresses differences in data types.
Configuration and syntax
To diff tables across systems, first configure Gateways for each database system in your SQLMesh configuration file.
This example configures bigquery
and snowflake
gateways:
Then, specify each table's gateway in the table_diff
command with this syntax: [source_gateway]|[source table]:[target_gateway]|[target table]
.
For example, we could diff the landing.table
table across bigquery
and snowflake
gateways like this:
This syntax tells SQLMesh to use the cross-database diffing algorithm instead of the normal within-database diffing algorithm.
After adding gateways to the table names, use table_diff
as described above - the same options apply for specifying the join keys, decimal precision, etc. See tcloud sqlmesh table_diff --help
for a full list of options.
Warning
Cross-database diff works for data objects (tables / views).
Diffing models is not supported because we do not assume that both the source and target databases are managed by SQLMesh.
Example output
A cross-database diff is broken up into two stages.
The first stage is a schema diff. This example shows that differences in column name case across the two tables are identified as schema differences:
$ tcloud sqlmesh table_diff 'bigquery|sqlmesh_example.full_model:snowflake|sqlmesh_example.full_model' --on item_id --show-sample
Schema Diff Between 'BIGQUERY|SQLMESH_EXAMPLE.FULL_MODEL' and 'SNOWFLAKE|SQLMESH_EXAMPLE.FULL_MODEL':
├── Added Columns:
│ ├── ITEM_ID (DECIMAL(38, 0))
│ └── NUM_ORDERS (DECIMAL(38, 0))
└── Removed Columns:
├── item_id (BIGINT)
└── num_orders (BIGINT)
Schema has differences; continue comparing rows? [y/n]:
SQLMesh prompts you before comparing data values across table rows. The prompt provides an opportunity to discontinue the comparison if the schemas are vastly different (potentially indicating a mistake) or you need to exclude columns from the diff because you know they won't match.
The second stage of the diff is comparing data values across tables. Within each system, SQLMesh divides the data into chunks, evaluates each chunk, and compares the outputs across systems. If a difference is found, it performs a row-level diff on that chunk by reading a sample of mismatched rows from each system.
This example shows that 2 rows were present in each system but had different values, one row was in Bigquery only, and one row was in Snowflake only:
Dividing source dataset into 10 chunks (based on 10947709 total records)
Checking chunks against target dataset
Chunk 1 hash mismatch!
Starting row-level comparison for the range (1 -> 3)
Identifying individual record hashes that don't match
Comparing
Row Counts:
├── PARTIAL MATCH: 2 rows (66.67%)
├── BIGQUERY ONLY: 1 rows (16.67%)
└── SNOWFLAKE ONLY: 1 rows (16.67%)
COMMON ROWS column comparison stats:
pct_match
num_orders 0.0
COMMON ROWS sample data differences:
Column: num_orders
┏━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓
┃ item_id ┃ BIGQUERY ┃ SNOWFLAKE ┃
┡━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩
│ 1 │ 5 │ 7 │
│ 2 │ 1 │ 2 │
└─────────┴──────────┴───────────┘
BIGQUERY ONLY sample rows:
item_id num_orders
7 4
SNOWFLAKE ONLY sample rows:
item_id num_orders
4 6
If there are no differences found between chunks, the source and target datasets can be considered equal:
Chunk 1 (1094771 rows) matches!
Chunk 2 (1094771 rows) matches!
...
Chunk 10 (1094770 rows) matches!
All 10947709 records match between 'bigquery|sqlmesh_example.full_model' and 'snowflake|TEST.SQLMESH_EXAMPLE.FULL_MODEL'
Info
Don't forget to specify the --show-sample
option if you'd like to see a sample of the actual mismatched data!
Otherwise, only high level statistics for the mismatched rows will be printed.
Supported engines
Cross-database diffing is supported on all execution engines that SQLMesh supports.