Spark
Local/Built-in Scheduler
Engine Adapter Type: spark
NOTE: Spark may not be used for the SQLMesh state connection.
Connection options
Option | Description | Type | Required |
---|---|---|---|
type |
Engine type name - must be spark |
string | Y |
config_dir |
Value to set for SPARK_CONFIG_DIR |
string | N |
catalog |
The catalog to use when issuing commands. See Catalog Support for details | string | N |
config |
Key/value pairs to set for the Spark Configuration. | dict | N |
Airflow Scheduler
Engine Name: spark
The SQLMesh Spark operator is very similar to the Airflow SparkSubmitOperator, and relies on the same SparkSubmitHook implementation.
To enable support for this operator, the Airflow Spark provider package should be installed on the target Airflow cluster as follows:
The operator requires an Airflow connection to determine the target cluster, queue, and deploy mode in which the Spark Job should be submitted. Refer to Apache Spark connection for more details.
By default, the connection ID is set to spark_default
, but it can be overridden using the engine_operator_args
parameter to the SQLMeshAirflow
instance as in the example below:
engine_operator_args
parameter can be used to override other job submission parameters, such as number of allocated cores, executors, and so forth. The full list of parameters that can be overridden can be found in sqlmesh.schedulers.airflow.operators.spark_submit.SQLMeshSparkSubmitOperator
.
Cluster mode
Each Spark job submitted by SQLMesh is a PySpark application that depends on the SQLMesh library in its Driver process (but not in Executors). This means that if the Airflow connection is configured to submit jobs in cluster
mode as opposed to client
mode, the user must ensure that the SQLMesh Python library is installed on each node of a cluster where Spark jobs are submitted. This is because there is no way to know in advance which specific node to which a Driver process will be scheduled. No additional configuration is required if the deploy mode is set to client
.
Catalog Support
SQLMesh's Spark integration is only designed/tested with a single catalog usage in mind. Therefore all SQLMesh models must be defined with a single catalog.
If catalog
is not set, then the behavior changes based on spark release:
- If >=3.4, then the default catalog is determined at runtime
- If <3.4, then the default catalog is
spark_catalog