Python models
Although SQL is a powerful tool, some use cases are better handled by Python. For example, Python may be a better option in pipelines that involve machine learning, interacting with external APIs, or complex business logic that cannot be expressed in SQL.
SQLMesh has first-class support for models defined in Python; there are no restrictions on what can be done in the Python model as long as it returns a Pandas or Spark DataFrame instance.
Definition
To create a Python model, add a new file with the *.py
extension to the models/
directory. Inside the file, define a function named execute
. For example:
The execute
function is wrapped with the @model
decorator, which is used to capture the model's metadata (similar to the MODEL
DDL statement in SQL models).
Because SQLMesh creates tables before evaluating models, the schema of the output DataFrame is a required argument. The @model
argument columns
contains a dictionary of column names to types.
The function takes an ExecutionContext
that is able to run queries and to retrieve the current time interval that is being processed, along with arbitrary key-value arguments passed in at runtime. The function can either return a Pandas or PySpark Dataframe instance.
If the function output is too large, it can also be returned in chunks using Python generators.
@model
specification
The arguments provided in the @model
specification have the same names as those provided in a SQL model's MODEL
DDL.
Most of the arguments are simply Python-formatted equivalents of the SQL version, but Python model kind
s are specified with model kind objects. All model kind arguments are listed in the models configuration reference page. A model's kind
object must be imported at the beginning of the model definition file before use in the model specification.
Supported model kind objects include:
- ViewKind()
- FullKind()
- SeedKind()
- IncrementalByTimeRangeKind()
- IncrementalByUniqueKeyKind()
- SCDType2Kind()
- EmbeddedKind()
- ExternalKind()
This example demonstrates how to specify an incremental by time range model kind in Python:
Execution context
Python models can do anything you want, but it is strongly recommended for all models to be idempotent. Python models can fetch data from upstream models or even data outside of SQLMesh.
Given an execution ExecutionContext
"context", you can fetch a DataFrame with the fetchdf
method:
Dependencies
In order to fetch data from an upstream model, you first get the table name using context
's table
method. This returns the appropriate table name for the current runtime environment:
The table
method will automatically add the referenced model to the Python model's dependencies.
The only other way to set dependencies of models in Python models is to define them explicitly in the @model
decorator using the keyword depends_on
. The dependencies defined in the model decorator take precedence over any dynamic references inside the function.
In this example, only upstream_dependency
will be captured, while another_dependency
will be ignored:
Examples
Basic
The following is an example of a Python model returning a static Pandas DataFrame.
Note: All of the metadata field names are the same as those in the SQL MODEL
DDL.
SQL Query and Pandas
The following is a more complex example that queries an upstream model and outputs a Pandas DataFrame:
PySpark
This example demonstrates using the PySpark DataFrame API. If you use Spark, the DataFrame API is preferred to Pandas since it allows you to compute in a distributed fashion.
Batching
If the output of a Python model is very large and you cannot use Spark, it may be helpful to split the output into multiple batches.
With Pandas or other single machine DataFrame libraries, all data is stored in memory. Instead of returning a single DataFrame instance, you can return multiple instances using the Python generator API. This minimizes the memory footprint by reducing the size of data loaded into memory at any given time.
This examples uses the Python generator yield
to batch the model output:
Serialization
SQLMesh executes Python code locally where SQLMesh is running by using our custom serialization framework.