Clickhouse
This page describes SQLMesh support for the Clickhouse engine, including configuration options specific to Clickhouse.
Note
Clickhouse may not be used for the SQLMesh state connection.
Background
Clickhouse is a distributed, column-oriented SQL engine designed to rapidly execute analytical workloads.
It provides users fine-grained control of its behavior, but that control comes at the cost of complex configuration.
This section provides background information about Clickhouse, providing context for how to use SQLMesh with the Clickhouse engine.
Object naming
Most SQL engines use a three-level hierarchical naming scheme: tables/views are nested within schemas, and schemas are nested within catalogs. For example, the full name of a table might be my_catalog.my_schema.my_table
.
Clickhouse instead uses a two-level hierarchical naming scheme that has no counterpart to catalog. In addition, it calls the second level in the hierarchy "databases." SQLMesh and its documentation refer to this second level as "schemas."
SQLMesh fully supports Clickhouse's two-level naming scheme without user action.
Table engines
Every Clickhouse table is created with a "table engine" that controls how the table's data is stored and queried. Clickhouse's (and SQLMesh's) default table engine is MergeTree
.
The MergeTree
engine family requires that every table be created with an ORDER BY
clause.
SQLMesh will automatically inject an empty ORDER BY
clause into every MergeTree
family table's CREATE
statement, or you can specify the columns/expressions by which the table should be ordered.
Clickhouse modes of operation
Conceptually, it may be helpful to view Clickhouse as having three modes of operation: single server, cluster, and Clickhouse Cloud. SQLMesh supports all three modes.
Single server mode
Single server mode is similar to other SQL engines: aside from choosing each table's engine, you do not need to worry about how computations are executed. You issue standard SQL commands/queries, and Clickhouse executes them.
Cluster mode
Cluster mode allows you to scale your Clickhouse engine to any number of networked servers. This enables massive workloads, but requires that you specify how computations are executed by the networked servers.
Clickhouse coordinates the computations on the networked servers with Clickhouse Keeper (it also supports Apache ZooKeeper).
You specify named virtual clusters of servers in the Keeper configuration, and those clusters provide namespaces for data objects and computations. For example, you might include all networked servers in the cluster you name MyCluster
.
In general, you must be connected to a Clickhouse server to execute commands. By default, each command you execute runs in single-server mode on the server you are connected to.
To associate an object with a cluster, DDL commands that create or modify it must include the text ON CLUSTER [your cluster name]
.
If you provide a cluster name in your SQLMesh connection configuration, SQLMesh will automatically inject the ON CLUSTER
statement into the DDL commands for all objects created while executing the project. We provide more information about clusters in SQLMesh below.
Clickhouse Cloud mode
Clickhouse Cloud is a managed Clickhouse platform. It allows you to scale Clickhouse without administering a cluster yourself or modifying your SQL commands to run on the cluster.
Clickhouse Cloud automates Clickhouse's cluster controls, which sometimes constrains Clickhouse's flexibility or how you execute SQL commands. For example, creating a table with a SELECT
command must occur in two steps on Clickhouse Cloud. SQLMesh handles this limitation for you.
Aside from those constraints, Clickhouse Cloud mode is similar to single server mode - you run standard SQL commands/queries, and Clickhouse Cloud executes them.
Cluster specification
A Clickhouse cluster allows multiple networked Clickhouse servers to operate on the same data object. Every cluster must be named in the Clickhouse configuration files, and that name is passed to a table's DDL statements in the ON CLUSTER
clause.
For example, we could create a table my_schema.my_table
on cluster TheCluster
like this: CREATE TABLE my_schema.my_table ON CLUSTER TheCluster (col1 Int8)
.
To create SQLMesh objects on a cluster, provide the cluster name to the cluster
key in the SQLMesh connection definition (see all connection parameters below).
SQLMesh will automatically inject the ON CLUSTER
clause and cluster name you provide into all project DDL statements.
Model definition
This section describes how you control a table's engine and other Clickhouse-specific functionality in SQLMesh models.
Table engine
SQLMesh uses the MergeTree
table engine with an empty ORDER BY
clause by default.
Specify a different table engine by passing the table engine definition to the model DDL's storage_format
parameter. For example, you could specify the Log
table engine like this:
You may also specify more complex table engine definitions. For example:
ORDER BY
MergeTree
family engines require that a table's CREATE
statement include the ORDER BY
clause.
SQLMesh will automatically inject an empty ORDER BY ()
when creating a table with an engine in the MergeTree
family. This creates the table without any ordering.
You may specify columns/expressions to ORDER BY
by passing them to the model physical_properties
dictionary's order_by
key.
For example, you could order by columns col1
and col2
like this:
Note that there is an =
between the order_by
key name and value (col1, col2)
.
Complex ORDER BY
expressions may need to be passed in single quotes, with interior single quotes escaped by the \
character.
PRIMARY KEY
Table engines may also accept a PRIMARY KEY
specification. Similar to ORDER BY
, specify a primary key in the model DDL's physical_properties
dictionary. For example:
Note that there is an =
between the primary_key
key name and value col1
.
Partitioning
Some Clickhouse table engines support partitioning. Specify the partitioning columns/expressions in the model DDL's partitioned_by
key.
For example, you could partition by columns col1
and col2
like this:
Settings
Clickhouse supports an immense number of settings, many of which can be altered in multiple places: Clickhouse configuration files, Python client connection arguments, DDL statements, SQL queries, and others.
This section discusses how to control Clickhouse settings in SQLMesh.
Connection settings
SQLMesh connects to Python with the clickhouse-connect
library. Its connection method accepts a dictionary of arbitrary settings that are passed to Clickhouse.
Specify these settings in the connection_settings
key. This example demonstrates how to set the distributed_ddl_task_timeout
setting to 300
:
DDL settings
Clickhouse settings may also be specified in DDL commands like CREATE
.
Specify these settings in a model DDL's physical_properties
key (where the order_by
and primary_key
values are specified, if present).
This example demonstrates how to set the index_granularity
setting to 128
:
Note that there is an =
between the index_granularity
key name and value 128
.
Query settings
Clickhouse settings may be specified directly in a model's query with the SETTINGS
keyword.
This example demonstrates setting the join_use_nulls
setting to 1
:
Multiple settings may be specified in a query with repeated use of the SETTINGS
keyword: SELECT * FROM other_table SETTINGS first_setting = 1 SETTINGS second_setting = 2;
.
Usage by SQLMesh
The Clickhouse setting join_use_nulls
affects the behavior of SQLMesh SCD models and table diffs. This section describes how SQLMesh uses query settings to control that behavior.
Background
In general, table JOIN
s can return empty cells for rows not present in both tables.
For example, consider LEFT JOIN
ing two tables left
and right
, where the column right_column
is only present in the right
table. Any rows only present in the left
table will have no value for right_column
in the joined table.
In other SQL engines, those empty cells are filled with NULL
s.
In contrast, Clickhouse fills the empty cells with data type-specific default values (e.g., 0 for integer column types). It will fill the cells with NULL
s instead if you set join_use_nulls
to 1
.
SQLMesh
SQLMesh automatically generates SQL queries for both SCD Type 2 models and table diff comparisons. These queries include table JOIN
s and calculations based on the presence of NULL
values.
Because those queries expect NULL
values in empty cells, SQLMesh automatically adds SETTINGS join_use_nulls = 1
to the generated SCD and table diff SQL code.
The SCD model definition query is embedded as a CTE in the full SQLMesh-generated query. If run alone, the model definition query would use the Clickhouse server's current join_use_nulls
value.
If that value is not 1
, the SQLMesh setting on the outer query would override the server value and produce incorrect results.
Therefore, SQLMesh uses the following procedure to ensure the model definition query runs with the correct join_use_nulls
value:
- If the model query sets
join_use_nulls
itself, do nothing - If the model query does not set
join_use_nulls
and the current serverjoin_use_nulls
value is1
, do nothing - If the model query does not set
join_use_nulls
and the current serverjoin_use_nulls
value is0
, addSETTINGS join_use_nulls = 0
to the CTE model query- All other CTEs and the outer query will still execute with a
join_use_nulls
value of1
- All other CTEs and the outer query will still execute with a
Local/Built-in Scheduler
Engine Adapter Type: clickhouse
Connection options
Option | Description | Type | Required |
---|---|---|---|
type |
Engine type name - must be clickhouse |
string | Y |
user |
The username to log in to your server. | string | Y |
host |
The hostname of your server. Do not include the http:// or https:// prefix. |
string | Y |
port |
The port to connect to your server. Default: 8123 for non-encrypted connections, 8443 for encrypted connections and Clickhouse Cloud. | int | Y |
cluster |
Name of the Clickhouse cluster on which SQLMesh should create objects. Should not be specified for standalone Clickhouse servers or Clickhouse Cloud. | string | N |
use_compression |
Enable compression for ClickHouse HTTP inserts and query results. Default: True | bool | N |
compression_method |
Use a specific compression method for inserts and query results - allowed values lz4 , zstd , br , or gzip . |
str | N |
query_limit |
Maximum number of rows to return for any query response. Default: 0 (unlimited rows) | int | N |
connect_timeout |
HTTP connection timeout in seconds. Default: 10 | int | N |
send_receive_timeout |
Send/receive timeout for the HTTP connection in seconds. Default: 300 | int | N |
verify |
Validate the ClickHouse server TLS/SSL certificate (hostname, expiration, etc.) if using HTTPS/TLS. Default: True | bool | N |
connection_settings |
Arbitrary Clickhouse settings passed to the clickhouse-connect client. |
dict[str, any] | N |