Skip to content

Databricks

Local/Built-in Scheduler

Engine Adapter Type: databricks

Installation

pip install "sqlmesh[databricks]"

Connection info

If you are always running SQLMesh commands directly on a Databricks Cluster (like in a Databricks Notebook) then the only relevant configuration is catalog and it is optional. The SparkSession provided by Databricks will be used to execute all SQLMesh commands.

Otherwise SQLMesh's Databricks implementation uses the Databricks SQL Connector to connect to Databricks by default. If your project contains PySpark DataFrames in Python models then it will use Databricks Connect to connect to Databricks. SQLMesh's Databricks Connect implementation supports Databricks Runtime 13.0 or higher. If SQLMesh detects you have Databricks Connect installed then it will use it for all Python models (so both Pandas and PySpark DataFrames).

Databricks connect execution can be routed to a different cluster than the SQL Connector by setting the databricks_connect_* properties. For example this allows SQLMesh to be configured to run SQL on a Databricks SQL Warehouse while still routing DataFrame operations to a normal Databricks Cluster.

Note: If using Databricks Connect please note the requirements and limitations

Connection options

Option Description Type Required
type Engine type name - must be databricks string Y
server_hostname Databricks instance host name string N
http_path HTTP path, either to a DBSQL endpoint (such as /sql/1.0/endpoints/1234567890abcdef) or to a DBR interactive cluster (such as /sql/protocolv1/o/1234567890123456/1234-123456-slid123) string N
access_token HTTP Bearer access token, such as Databricks Personal Access Token string N
catalog Spark 3.4+ Only if not using SQL Connector. The name of the catalog to use for the connection. Defaults to use Databricks cluster default (most likely hive_metastore). string N
http_headers SQL Connector Only: An optional dictionary of HTTP headers that will be set on every request dict N
session_configuration SQL Connector Only: An optional dictionary of Spark session parameters. Execute the SQL command SET -v to get a full list of available commands. dict N
databricks_connect_server_hostname Databricks Connect Only: Databricks Connect server hostname. Uses server_hostname if not set. string N
databricks_connect_access_token Databricks Connect Only: Databricks Connect access token. Uses access_token if not set. string N
databricks_connect_cluster_id Databricks Connect Only: Databricks Connect cluster ID. Uses http_path if not set. Cannot be a Databricks SQL Warehouse. string N
force_databricks_connect When running locally, force the use of Databricks Connect for all model operations (so don't use SQL Connector for SQL models) bool N
disable_databricks_connect When running locally, disable the use of Databricks Connect for all model operations (so use SQL Connector for all models) bool N

Airflow Scheduler

Engine Name: databricks / databricks-submit / databricks-sql.

Databricks has multiple operators to help differentiate running a SQL query vs. running a Python script.

When evaluating models, the SQLMesh Databricks integration implements the DatabricksSubmitRunOperator. This is needed to be able to run either SQL or Python scripts on the Databricks cluster.

When performing environment management operations, the SQLMesh Databricks integration is similar to the DatabricksSqlOperator, and relies on the same DatabricksSqlHook implementation. All environment management operations are SQL-based, and the overhead of submitting jobs can be avoided.

Engine: databricks-submit

Whether evaluating models or performing environment management operations, the SQLMesh Databricks integration implements the DatabricksSubmitRunOperator.

Engine: databricks-sql

Forces the SQLMesh Databricks integration to use the operator based on the DatabricksSqlOperator for all operations. If your project is pure SQL operations, then this is an option.

To enable support for this operator, the Airflow Databricks provider package should be installed on the target Airflow cluster along with the SQLMesh package with databricks extra as follows:

pip install apache-airflow-providers-databricks
sqlmesh[databricks]

The operator requires an Airflow connection to determine the target Databricks cluster. Refer to Databricks connection for more details. SQLMesh requires that http_path be defined in the connection since it uses this to determine the cluster for both SQL and submit operators.

Example format: databricks://<hostname>?token=<token>&http_path=<http_path>

By default, the connection ID is set to databricks_default, but it can be overridden using both the engine_operator_args and the ddl_engine_operator_args parameters to the SQLMeshAirflow instance. In addition, one special configuration that the SQLMesh Airflow evaluation operator requires is a dbfs path to store an application to load a given SQLMesh model. Also, a payload is stored that contains the information required for SQLMesh to do the loading. This must be defined in the evaluate_engine_operator_args parameter. Example of defining both:

from sqlmesh.schedulers.airflow.integration import SQLMeshAirflow

sqlmesh_airflow = SQLMeshAirflow(
    "databricks",
    default_catalog="<catalog name>",
    engine_operator_args={
        "databricks_conn_id": "<Connection ID>",
        "dbfs_location": "dbfs:/FileStore/sqlmesh",
    },
    ddl_engine_operator_args={
        "databricks_conn_id": "<Connection ID>",
    }
)

for dag in sqlmesh_airflow.dags:
    globals()[dag.dag_id] = dag

Note: If your Databricks connection is configured to run on serverless DBSQL, then you need to define existing_cluster_id or new_cluster in your engine_operator_args. Example:

1
2
3
4
5
6
7
8
sqlmesh_airflow = SQLMeshAirflow(
    "databricks",
    default_catalog="<catalog name>",
    engine_operator_args={
        "dbfs_location": "dbfs:/FileStore/sqlmesh",
        "existing_cluster_id": "1234-123456-slid123",
    }
)