StarRocks
Overview
StarRocks is a next-generation sub-second MPP OLAP database designed for real-time analytics. It provides high concurrency, low latency, and supports both batch and stream processing.
SQLMesh supports StarRocks through its MySQL-compatible protocol, providing StarRocks-specific optimizations for table models, indexing, partitioning, and more. The adapter leverages StarRocks's strengths for analytical workloads with sensible defaults and advanced configuration support.
Prerequisites
- Install SQLMesh with the StarRocks extra:
- Initialize a SQLMesh project (if you haven't already):
- Configure a separate state backend:
- StarRocks is currently not supported as a SQLMesh
state_connection. - Use DuckDB (recommended) or another engine for SQLMesh state.
Connection Configuration Example
StarRocks setup note (optional)
If you're running a shared-nothing cluster with a single backend, you may need to adjust the default replication number:
Quickstart
1) A minimal table (DUPLICATE KEY default)
MODEL (
name user_events,
kind FULL,
physical_properties (
distributed_by = RANDOM
)
);
SELECT
user_id,
event_time,
event_type
FROM source.user_events;
A DUPLICATE KEY table can usually be used as a FULL kind model.
2) An incremental table (PRIMARY KEY required)
MODEL (
name user_events_inc,
kind INCREMENTAL_BY_TIME_RANGE(
time_column event_date
),
physical_properties (
primary_key = (user_id, event_date),
partition_by = (date_trunc('day', event_date)),
distributed_by = (kind=HASH, expressions=user_id, buckets=16)
)
);
SELECT
user_id,
event_date,
COUNT(*) AS cnt
FROM source.user_events
WHERE event_date BETWEEN @start_ds AND @end_ds
GROUP BY user_id, event_date;
Table Types
StarRocks supports four table types: DUPLICATE KEY, PRIMARY KEY, UNIQUE KEY, and AGGREGATE KEY.
SQLMesh configures StarRocks table types via physical_properties (engine-specific table properties).
Note: StarRocks
AGGREGATE KEYrequires per-value-column aggregation functions, which SQLMesh model syntax DOES NOT currently support. UsePRIMARY KEYorDUPLICATE KEYinstead.
DUPLICATE KEY Type (Default)
If you do not set a key type, StarRocks creates a DUPLICATE KEY table by default.
Example:
PRIMARY KEY Type
For incremental models, a PRIMARY KEY table is mandatory. StarRocks only supports the full DELETE ... WHERE ... and MERGE semantics that incremental kinds rely on (such as INCREMENTAL_BY_TIME_RANGE, INCREMENTAL_BY_UNIQUE_KEY, INCREMENTAL_BY_PARTITION, and SCD_TYPE_2) on PRIMARY KEY tables. On DUPLICATE KEY, UNIQUE KEY, and AGGREGATE KEY tables these operations are not supported well enough.
SQLMesh enforces this: an incremental model on StarRocks without a primary key fails fast with a clear error. Set physical_properties.primary_key, for example physical_properties (primary_key = (user_id, event_date)). As a convenience, an INCREMENTAL_BY_UNIQUE_KEY model's unique_key is automatically promoted to a PRIMARY KEY table.
SQLMesh engine also applies conservative WHERE transformations for compatibility (for example, converting BETWEEN to >= AND <=, removing boolean literals, and converting DELETE ... WHERE TRUE to TRUNCATE TABLE).
SQLMesh currently does not support specifying
primary_keyas a model parameter.
Example (INCREMENTAL_BY_TIME_RANGE):
MODEL (
name user_events,
kind INCREMENTAL_BY_TIME_RANGE(
time_column event_date
),
physical_properties (
primary_key = (user_id, event_date),
distributed_by = (kind=HASH, expressions=user_id, buckets=16)
)
);
SELECT
user_id,
event_date,
COUNT(*) AS cnt
FROM source.user_events
WHERE event_date BETWEEN @start_ds AND @end_ds
GROUP BY user_id, event_date;
UNIQUE KEY Type
You can create a UNIQUE KEY table by setting physical_properties.unique_key. Note that a UNIQUE KEY table is not sufficient for incremental models — incremental kinds require a PRIMARY KEY table (see PRIMARY KEY Type).
Example:
MODEL (
name user_events_unique,
kind FULL,
physical_properties (
unique_key = (user_id, event_date),
distributed_by = (kind=HASH, expressions=user_id, buckets=16)
)
);
Table Properties
This section documents StarRocks engine-specific table properties via physical_properties (...) (table properties). Most properties support:
- Structured form (recommended): easier validation and clearer intent
- String fallback: for convenience or when you want to paste native StarRocks syntax quickly
Most of the time, the value syntax is the same or similar as a corresponding clause in StarRocks, espacially for a string type value.
When specifying string values, prefer single quotes.
Configuration Matrix
| Property | Where | Recommended form | String fallback | Notes |
|---|---|---|---|---|
primary_key |
physical_properties |
primary_key = (col1, col2) |
primary_key = 'col1, col2' |
Required for PRIMARY KEY tables (recommended for incremental). |
duplicate_key |
physical_properties |
duplicate_key = (col1, col2) |
duplicate_key = 'col1, col2' |
Explicitly sets DUPLICATE KEY table type. |
unique_key |
physical_properties |
unique_key = (col1, col2) |
unique_key = 'col1, col2' |
Sets UNIQUE KEY table type. |
partitioned_by / partition_by |
MODEL / physical_properties |
partitioned_by (dt) (model param) / partition_by = RANGE(dt, region) (table property) |
partition_by = 'RANGE(dt, region)' |
Its' recommended to use partition_by in physical_properties for RANGE/LIST partitioning together with partitions. |
partitions |
physical_properties |
partitions = ('PARTITION ...', 'PARTITION ...') |
partitions = 'PARTITION ...' |
Initial partitions; easiest to express as strings. When using RANGE or LIST partitioning, you need to specify initial partitions. |
distributed_by |
physical_properties |
distributed_by = (kind=HASH, expressions=(c1, c2), buckets=10) |
distributed_by = 'HASH(c1, c2) BUCKETS 10' / distributed_by = 'RANDOM' |
|
clustered_by / order_by |
MODEL / physical_properties |
clustered_by (col1, col2) / order_by = (col1, col2) |
order_by = 'col1, col2' |
Ordering/clustering columns for query performance if it's not the same as the table key. |
| Other properties | physical_properties |
Use strings (recommended) | Use strings | StarRocks PROPERTIES are string key/value pairs. |
Notes:
- You can use enum-like values without quotes (for example
HASH,RANDOM,IMMEDIATE), but strings are also accepted (prefer single quotes). - Aliases exist for convenience: use
partition_by(table property) as an alias ofpartitioned_by(model parameter), andorder_by↔clustered_by. - Only several properties can be set as model
parameters:
partitioned_by,clustered_by. But, for simplity, you're recommended to use table properties only.
Table Key Properties
Table key properties accept multiple forms:
- Structured:
color(col1, col2, ...) - String:
'col'or'col1, col2'
Syntax:
- Structured:
primary_key = col,primary_key = (col1, col2),duplicate_key = (col2) - String:
primary_key = 'col1, col2',unique_key = '(col2, col3)'.
PRIMARY KEY
MODEL (
name my_pk_table,
kind FULL,
physical_properties (
primary_key = (id, ds),
distributed_by = (kind=HASH, expressions=id, buckets=10)
)
);
DUPLICATE KEY
MODEL (
name my_dup_table,
kind FULL,
physical_properties (
duplicate_key = (id, ds),
distributed_by = RANDOM
)
);
UNIQUE KEY
MODEL (
name my_unique_table,
kind FULL,
physical_properties (
unique_key = (id, ds),
distributed_by = (kind=HASH, expressions=id, buckets=10)
)
);
Partitioning
StarRocks supports RANGE partitioning, LIST partitioning, and expression partitioning.
You can specify partitioning either:
- As a model parameter:
partitioned_by (...)(good for simple expressions) - As a table property:
physical_properties(partition_by=...)(recommended when you need RANGE/LIST, or complex expressions)
For RANGE and LIST partitioning, you generally need to provide initial partitions (pre-created partitions). For expression partitioning, partitions is usually not needed.
partitioned_by / partition_by
NOTE:
partitioned_by (...)can only be used as a model parameter (SQLMesh enforces this constraint).partition_bycan be provided inphysical_propertiesas table properties (for advanced partitioning).
Syntax:
- Expression list:
partitioned_by (col)/partitioned_by (expr1, expr2) - for complex example:
partition_by = (date_trunc('day', col2), col3) - RANGE/LIST:
partition_by = RANGE(col1, col2)/partition_by = LIST(col1, col2) - String fallback:
partition_by = 'RANGE(col1, col2)'
partitions
Syntax:
- Tuple of strings:
partitions = ('PARTITION ...', 'PARTITION ...') - Single string:
partitions = 'PARTITION ...'
Expression partitioning
MODEL (
name my_partitioned_model,
kind INCREMENTAL_BY_TIME_RANGE(time_column event_date),
partitioned_by (date_trunc('day', event_time), region),
physical_properties (
primary_key = (user_id, event_date, region),
distributed_by = (kind=HASH, expressions=user_id, buckets=10)
)
);
RANGE partitioning
MODEL (
name my_partitioned_model_advanced,
kind FULL,
physical_properties (
partition_by = RANGE(event_time),
partitions = (
'PARTITION p20240101 VALUES [("2024-01-01"), ("2024-01-02"))',
'PARTITION p20240102 VALUES [("2024-01-02"), ("2024-01-03"))'
),
distributed_by = (kind=HASH, expressions=region, buckets=10)
)
);
It's similar for LIST partitioning as RANGE partitioning.
Distribution
StarRocks supports both HASH and RANDOM distribution. You can use a structured value or a string.
- Structured type syntax:
(kind=<kind> [, expressions=<columns>] [, buckets=<num>])
- kind:
HASHORRANDOM. - expressions: a single column or a tuple of columns, such as
col1or(col1, col2). (optional) - buckets: bucket number. (optional)
- String type is similar as:
'HASH(id) BUCKETS 10', which is the same as the distribution clause in StarRocks'sCREATE TABLE. - Or even a single enum-like value:
distributed_by = RANDOM.
HASH distribution
Structured type (recommended):
MODEL (
name my_table,
kind FULL,
physical_properties (
distributed_by = (kind=HASH, expressions=(user_id), buckets=10)
)
);
RANDOM distribution
Simple enumerate type:
String fallback
A single string, which is the same as the clause in StarRocks's CREATE TABLE.
MODEL (
name my_table_string_dist,
kind FULL,
physical_properties (
distributed_by = 'HASH(user_id) BUCKETS 10'
)
);
Ordering
You can use clustered_by or order_by to specify the column ordering to optimize query performance if it's not the same the table key.
You can specify clustered_by both as a model parameter and a table property, but you can only specify order_by as a table property.
Syntax:
- Structured:
order_by = col/order_by = (col1, col2) - String fallback:
order_by = 'col1, col2'
MODEL (
name my_ordered_table,
kind FULL,
physical_properties (
order_by = (ds, id),
distributed_by = (kind=HASH, expressions=id, buckets=10)
)
);
Generic PROPERTIES
Any additional properties in physical_properties are passed through as StarRocks PROPERTIES. Since StarRocks PROPERTIES values are typically strings, using strings is recommended.
MODEL (
name advanced_table,
kind FULL,
physical_properties (
primary_key = (id),
distributed_by = (kind=HASH, expressions=id, buckets=8),
replication_num = '1',
storage_medium = 'SSD',
enable_persistent_index = 'true',
compression = 'LZ4'
)
);
Views and Materialized Views
Views
StarRocks supports view SECURITY via virtual_properties.security.
Syntax:
security = INVOKERorsecurity = NONE. (optional)
MODEL (
name user_summary_view,
kind VIEW,
virtual_properties (
security = INVOKER
)
);
SELECT
user_id,
COUNT(*) AS event_count,
MAX(event_time) AS last_event_time
FROM user_events
GROUP BY user_id;
Materialized Views (MV)
SQLMesh uses kind VIEW (materialized true) to create materialized views.
For ASYNC MVs, StarRocks requires a REFRESH clause, so you must specify at least one of refresh_moment or refresh_scheme.
MV properties (including refresh_moment / refresh_scheme and other table-like properties such as partitioning, distribution, ordering, and generic properties) must be specified in physical_properties.
Refresh properties:
refresh_moment:IMMEDIATEorDEFERRED(optional)refresh_scheme:MANUALorASYNC ...(optional)- If you specify it with the
START/EVERY, you must specify it as a whole string, quoted by a pair of quotes. - Examples:
ASYNC,MANUAL,ASYNC START ("2024-01-01 00:00:00") EVERY (INTERVAL 5 MINUTE) - The syntax of
ASYNC ...clause is the same as the clause in StarRocks.
MODEL (
name user_summary_mv,
kind VIEW (
materialized true
),
physical_properties (
refresh_moment = DEFERRED,
refresh_scheme = 'ASYNC START ("2024-01-01 00:00:00") EVERY (INTERVAL 5 MINUTE)'
)
);
SELECT
user_id,
COUNT(*) AS event_count,
MAX(event_time) AS last_event_time
FROM user_events
GROUP BY user_id;
Audits on materialized views:
Audits require data to exist in the materialized view when they run. Because StarRocks refreshes async MVs as background jobs, the data is not guaranteed to be present immediately after the MV is created. To make audits deterministic, when a materialized view has audits SQLMesh issues a synchronous REFRESH MATERIALIZED VIEW <name> WITH SYNC MODE right after creating the MV, which blocks until the data is materialized.
For this to work safely, a materialized view with audits must set refresh_moment = 'DEFERRED'. This prevents StarRocks' automatic (IMMEDIATE) refresh from racing with the synchronous refresh that SQLMesh issues. If the MV has audits and refresh_moment is IMMEDIATE (or unset, which defaults to IMMEDIATE in StarRocks), SQLMesh raises an error before creating the MV.
MODEL (
name user_summary_mv,
kind VIEW (
materialized true
),
audits (
not_null(columns := (user_id))
),
physical_properties (
-- required when the MV has audits
refresh_moment = DEFERRED,
refresh_scheme = 'ASYNC'
)
);
SELECT user_id, COUNT(*) AS event_count FROM user_events GROUP BY user_id;
Excluding tables from refresh:
excluded_trigger_tables and excluded_refresh_tables let you control which base tables participate in an async MV's refresh cycle:
excluded_trigger_tables: base tables whose data changes should not automatically trigger a refresh of this MV.excluded_refresh_tables: base tables that should not be scanned when the MV refreshes.
Both properties accept a single table reference or a comma-separated list of table references.
StarRocks requires the physical base table name for these properties, not the logical view name that SQLMesh normally exposes. SQLMesh handles this automatically: when a reference matches a managed SQLMesh model, the logical name is resolved to its physical table name before the CREATE MATERIALIZED VIEW statement is issued. References that do not match any managed model are passed through unchanged.
MODEL (
name mydb.order_summary_mv,
kind VIEW (
materialized true
),
physical_properties (
refresh_scheme = 'ASYNC',
-- SQLMesh resolves mydb.orders and mydb.order_items to their physical table names
excluded_trigger_tables = 'mydb.orders,mydb.order_items',
excluded_refresh_tables = mydb.orders
)
);
SELECT order_id, SUM(amount) AS total FROM mydb.orders GROUP BY order_id;
A single reference can be written as a bare identifier (mydb.orders) or as a quoted string. Multiple references must be provided as a quoted, comma-separated string ('mydb.orders,mydb.order_items').
Other properties:
You can specify partitioning, distribution, order by and properties the same as normal table properties. But notice that only supported MV properties are useful, Refer to StarRocks' doc for MV creation.
Notes:
-
SQLMesh does not recreate materialized views on every
sqlmesh run. Once an MV exists, SQLMesh leaves it in place and lets StarRocks keep it current. This is intentional:- StarRocks async MVs revalidate themselves automatically, even when the underlying data is dropped, so a periodic drop-and-recreate is unnecessary.
- StarRocks async MVs either refresh automatically (per their
refresh_scheme) or can be refreshed explicitly withREFRESH MATERIALIZED VIEW, which also enables partition-level (incremental) refresh. A SQLMesh-driven recreate would instead force a full rebuild.
The MV is (re)built only when it does not yet exist — for example when you first deploy it, or when a model change produces a new version. To change a materialized view's definition, update the model and run
sqlmesh plan. * There are some restriction forpartitioning, you need to refer StarRocks' doc for MV partitioning specification. * StarRocks MV schema supports a column list but does not support explicit data types in that list. Column data types come from theAS SELECT ...query. * If you create MVs from a dataframe via the Python API, providetarget_columns_to_types(aDict[str, exp.DataType]). If you don't care about exact types, you can set all columns toVARCHARas a fallback:
from sqlglot import exp
target_columns_to_types = {
"col1": exp.DataType.build("VARCHAR"),
"col2": exp.DataType.build("VARCHAR"),
}
Limitations
- No SYNC MV support: synchronous materialized views are not supported yet.
-
FULLmodels are not replaced atomically: StarRocks does not supportCREATE OR REPLACE TABLEand has no multi-statement transactions (in version 3.5 and lower), so SQLMesh refreshes aFULLmodel by emptying the existing table (aTRUNCATE, or aDELETEwhen a filter applies) and then inserting the new result set as separate, auto-committed statements. There is a brief window between the truncate/delete and the completion of the insert during which the table is empty or partially populated, so readers querying it during that window may see missing or incomplete data. Incremental kinds (e.g.INCREMENTAL_BY_TIME_RANGE,INCREMENTAL_BY_PARTITION) do not fully eliminate this — StarRocks applies them as the same non-atomic delete-then-insert — but they narrow the affected rows to the partition/time range being processed rather than emptying the whole table, so unaffected partitions remain readable throughout. SQLMesh has no way to make these replacements atomic on StarRocks 3.5 and lower.Future work: this PR targeted StarRocks 3.5, but StarRocks has since expanded its capabilities considerably (the integration now runs against 4.1). Later work should investigate using
INSERT OVERWRITEtogether with the transactional/atomic-swap guarantees available in newer StarRocks versions to close this gap (see theINSERT_OVERWRITE_STRATEGYandSUPPORTS_TRANSACTIONSflags in the StarRocks engine adapter). * No tuple IN: StarRocks does not support(c1, c2) IN ((v1, v2), ...). * NoSELECT ... FOR UPDATE: StarRocks is an OLAP database and does not support row locks; SQLMesh removesFOR UPDATEwhen executing SQLGlot expressions. * RENAME caveat:ALTER TABLE db.old RENAME db.newis not supported; theRENAMEtarget cannot be qualified with a database name.
Dependencies
To use StarRocks with SQLMesh, install the required MySQL driver: