Spark
Local/Built-in Scheduler
Engine Adapter Type: spark
Connection options
Option | Description | Type | Required |
---|---|---|---|
type |
Engine type name - must be spark |
string | Y |
config_dir |
Value to set for SPARK_CONFIG_DIR |
string | N |
catalog |
Spark 3.4+ Only. The catalog to use when issuing commands | string | N |
config |
Key/value pairs to set for the Spark Configuration. | dict | N |
Airflow Scheduler
Engine Name: spark
The SQLMesh Spark operator is very similar to the Airflow SparkSubmitOperator, and relies on the same SparkSubmitHook implementation.
To enable support for this operator, the Airflow Spark provider package should be installed on the target Airflow cluster as follows:
The operator requires an Airflow connection to determine the target cluster, queue, and deploy mode in which the Spark Job should be submitted. Refer to Apache Spark connection for more details.
By default, the connection ID is set to spark_default
, but it can be overridden using the engine_operator_args
parameter to the SQLMeshAirflow
instance as in the example below:
engine_operator_args
parameter can be used to override other job submission parameters, such as number of allocated cores, executors, and so forth. The full list of parameters that can be overridden can be found in sqlmesh.schedulers.airflow.operators.spark_submit.SQLMeshSparkSubmitOperator
.
Cluster mode
Each Spark job submitted by SQLMesh is a PySpark application that depends on the SQLMesh library in its Driver process (but not in Executors). This means that if the Airflow connection is configured to submit jobs in cluster
mode as opposed to client
mode, the user must ensure that the SQLMesh Python library is installed on each node of a cluster where Spark jobs are submitted. This is because there is no way to know in advance which specific node to which a Driver process will be scheduled. No additional configuration is required if the deploy mode is set to client
.