Model Training and Inference¶
Note
The Snowflake Feature Store API is available in the Snowpark ML Python package (snowflake-ml-python
) v1.5.0 and later.
Generating tables for training¶
You can generate a training data set with the feature store’s generate_training_set
method, which enriches a Snowpark DataFrame that contains the source data with the derived feature values. To select a subset of features from a feature view, use fv.slice
.
For time-series features, provide the timestamp column name to automate the point-in-time feature value lookup.
training_set = fs.generate_training_set( spine_df=MySourceDataFrame, features=[registered_fv], save_as="data_20240101", # optional spine_timestamp_col="TS", # optional spine_label_cols=["LABEL1", "LABEL2"], # optional include_feature_view_timestamp_col=False, # optional )
Note
Here, the spine_df
(MySourceDataFrame
) is a DataFrame containing the entity IDs in source data, the time stamp, label columns, and additional columns containing training data. Requested features are retrieved for the list of entity IDs, with point-in-time correctness with respect to the provided time stamp.
Training sets are ephemeral by default; they exist only as Snowpark DataFrames and are not materialized. To materialize the training set to a Table, specify the argument save_as
with a valid, non-existing table name. The training set is written to the newly created table.
Materialized tables currently don’t guarantee immutability and have limited metadata support. If you require these features, consider using Snowflake Datasets instead.
Note
The generate_training_set
API is available in snowflake-ml-python
version 1.5.4
or later.
Generating Snowflake Datasets for training¶
You can generate a Snowflake Dataset using the feature store’s generate_dataset
method. The method signature is similar to generate_training_set
; the key differences are the required name
argument, optional version
argument, and additional metadata fields. generate_dataset
always materializes the result.
Snowflake Datasets provide an immutable, file-based snapshot of data, which helps to ensure model reproducibility and efficient data ingestion for large datasets and/or distributed training. Datasets also have expanded metadata support for easier discoverability and consumption.
The following code illustrates the generation of a dataset from a feature view:
dataset: Dataset = fs.generate_dataset( name="MY_DATASET", spine_df=MySourceDataFrame, features=[registered_fv], version="v1", # optional spine_timestamp_col="TS", # optional spine_label_cols=["LABEL1", "LABEL2"], # optional include_feature_view_timestamp_col=False, # optional desc="my new dataset", # optional )
Model training¶
After creating a training data set, you can pass it to your model when training as follows.
If you generated a Snowpark DataFrame, pass it directly to your model:
my_model = train_my_model(training_set)
If you generated a Snowflake Dataset, convert it to a Snowpark DataFrame and pass it to your model:
my_model = train_my_model(dataset.read.to_snowpark_dataframe())
Once trained, the model can be logged in the Snowflake Model Registry.
Retrieving features and making predictions¶
If you created a model in your Python session, you can retrieve the feature view from the feature store and pass it to your model for prediction, as shown here:
prediction_df: snowpark.DataFrame = fs.retrieve_feature_values( spine_df=prediction_source_dataframe, features=[registered_fv], spine_timestamp_col="TS", exclude_columns=[], ) # predict with your previously trained model my_model.predict(prediction_df)
You can exclude specified columns using the exclude_columns
argument, or include the timestamp column from the feature view by setting include_feature_view_timestamp_col
.