Although SQL is a powerful tool, some use cases are better handled by Python. For example, Python may be a better option in pipelines that involve machine learning, interacting with external APIs, or complex business logic that cannot be expressed in SQL.
SQLMesh has first-class support for models defined in Python; there are no restrictions on what can be done in the Python model as long as it returns a Pandas or Spark DataFrame instance.
To create a Python model, add a new file with the
*.py extension to the
models/ directory. Inside the file, define a function named
execute. For example:
Because SQLMesh creates tables before evaluating models, the schema of the output DataFrame is a required argument. The
columns contains a dictionary of column names to types.
The function takes an
ExecutionContext that is able to run queries and to retrieve the current time interval that is being processed, along with arbitrary key-value arguments passed in at runtime. The function can either return a Pandas or PySpark Dataframe instance.
If the function output is too large, it can also be returned in chunks using Python generators.
Python models can do anything you want, but it is strongly recommended for all models to be idempotent. Python models can fetch data from upstream models or even data outside of SQLMesh.
Given an execution
ExecutionContext "context", you can fetch a DataFrame with the
In order to fetch data from an upstream model, you first get the table name using
table method. This returns the appropriate table name for the current runtime environment:
table method will automatically add the referenced model to the Python model's dependencies.
The only other way to set dependencies of models in Python models is to define them explicitly in the
@model decorator using the keyword
depends_on. The dependencies defined in the model decorator take precedence over any dynamic references inside the function.
In this example, only
upstream_dependency will be captured, while
another_dependency will be ignored:
The following is an example of a Python model returning a static Pandas DataFrame.
Note: All of the metadata field names are the same as those in the SQL
SQL Query and Pandas
The following is a more complex example that queries an upstream model and outputs a Pandas DataFrame:
This example demonstrates using the PySpark DataFrame API. If you use Spark, the DataFrame API is preferred to Pandas since it allows you to compute in a distributed fashion.
If the output of a Python model is very large and you cannot use Spark, it may be helpful to split the output into multiple batches.
With Pandas or other single machine DataFrame libraries, all data is stored in memory. Instead of returning a single DataFrame instance, you can return multiple instances using the Python generator API. This minimizes the memory footprint by reducing the size of data loaded into memory at any given time.
This examples uses the Python generator
yield to batch the model output:
SQLMesh executes Python code locally where SQLMesh is running by using our custom serialization framework.