HowardRiddiough
diff --git a/‎README.md‎
Lines changed: 48 additions & 2 deletions b/‎README.md‎
Lines changed: 48 additions & 2 deletions
diff --git a/‎deploying-python-ml-in-pyspark.ipynb‎
Lines changed: 328 additions & 0 deletions b/‎deploying-python-ml-in-pyspark.ipynb‎
Lines changed: 328 additions & 0 deletions
@@ -1,2 +1,48 @@
-# deploy-sklearn-in-pyspark
-Deploying python ML models in pyspark using Pandas UDFs
+# Python ML Deployment in PySpark Using Pandas UDFs
+----
+
+This repo defines a versatile python function that can be used to deploy python ml in PySpark.
+
+# Introducing `spark_predict`: a vessle for python ml deployment in PySpark
+----
+
+Making predictions in PySpark using sophistaicated python ml is unlocked using our `spark_predict` function defined below.
+
+`spark_predict` is a wrapper around a `pandas_udf`, a wrapper is used to enable a python ml model to be passed to the `pandas_udf`.
+
+ def spark_predict(model, *cols) -> pyspark.sql.column:
+  """This function deploys python ml in PySpark using the `predict` method of the `model` parameter.
+ 
+  Args:
+  model: python ml model with sklearn API
+  *cols (list-like): Features used for predictions, required to be present as columns in the spark DataFrame used to make predictions.
+  """
+  @sf.pandas_udf(returnType=DoubleType())
+  def predict_pandas_udf(*cols):
+  # cols will be a tuple of pandas.Series here.
+  X = pd.concat(cols, axis=1)
+  return pd.Series(model.predict(X))
+ 
+  return predict_pandas_udf(*cols)
+
+
+# Python ML Deployment in practice
+----
+
+The [deploying-python-ml-in-pyspark](deploying-python-ml-in-pyspark.ipynb) notebook demonstrates how `spark_predict` can be used to deploy python ML in PySpark. It is shown that `spark_predict` is capable of deploying simple ml models in addition to more sophisticated pipelines in PySpark.
+
+I often use both categorical and numerical features in predictive model, so I have included an example that includes an sklearn `Pipeline` designed to scale numerical and encode categorical data. This particular pipeline appends two preprocessing pipelines to a random forest to create a full prediction pipeline that will transform categorical and numerical data and fit a model. And of course this pipeline is deployed in PySpark using the `spark_predict` function.
+
+# Requirements
+----
+See [requirements.txt](requirements.txt).
+
+# PySpark Installation
+----
+The code used in the [deploying-python-ml-in-pyspark](deploying-python-ml-in-pyspark.ipynb) notebook requires installation of PySpark. We leave the installation of PySpark for the user.
+
+# Further Reading
+----
+The code used in is based on the excellent excellent blog post ["Prediction at Scale with scikit-learn and PySpark Pandas UDFs"](https://medium.com/civis-analytics/prediction-at-scale-with-scikit-learn-and-pyspark-pandas-udfs-51d5ebfb2cd8) written by **Michael Heilman**.
+
+For further reading on transforming columns with multiple types please visit this [example](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html) provided by sklearn.
@@ -0,0 +1,328 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Deploying Python ML in PySpark\n",
+ "\n",
+ "----\n",
+ "\n",
+ "This notebook intends to introduce a PySpark `pandas_udf` function that can be used to **deploy both python ML models and sophisticated pipelines**. With this in mind please excuse the barbaric use of `RandomForestRegressor`. \n",
+ "\n",
+ "In this notebook we deploy sklearn's `RandomForestRegressor` in PySpark. The [Titanic](https://raw.githubusercontent.com/amueller/scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv) dataset is used simply for convenience. In the examples below a number of features are used to estimate ticket \"fare\", we will fit our models/pipelines in pandas and deploy in PySpark. **Please be aware** this notebook uses the `pyspark` package, therefore any user looking to run the cells below will need to install PySpark.\n",
+ "\n",
+ "In practice the model used below can be replaced by any other predictive python model, be that a `RandomForestClassifier`, `XGBoost`, `LightGBM` or any other package you care to use with an sklearn like API."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import shutil\n",
+ "from datetime import datetime, timedelta\n",
+ "\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from sklearn.ensemble import RandomForestRegressor\n",
+ "from sklearn.pipeline import Pipeline\n",
+ "from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder\n",
+ "from sklearn.compose import ColumnTransformer\n",
+ "import pyspark.sql\n",
+ "from pyspark.sql import SparkSession\n",
+ "import pyspark.sql.functions as sf\n",
+ "from pyspark.sql.types import DoubleType\n",
+ "import pyarrow"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Defining data path, target and features\n",
+ "TITANIC_URL = \"https://raw.githubusercontent.com/amueller/scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv\"\n",
+ "TARGET = \"fare\"\n",
+ "NUMERICAL_FEATURES = [\n",
+ " \"sibsp\",\n",
+ " \"parch\",\n",
+ " \"age\"\n",
+ "]\n",
+ "CATEGORICAL_FEATURES = [\n",
+ " \"sex\",\n",
+ " \"cabin\"\n",
+ "]\n",
+ "ALL_FEATURES = NUMERICAL_FEATURES + CATEGORICAL_FEATURES"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# It is necessary for us to set a SparkSession\n",
+ "for dir in [\"metastore_db\", \"derby.log\", \".cache\"]:\n",
+ " try:\n",
+ " shutil.rmtree(dir)\n",
+ " except OSError:\n",
+ " pass\n",
+ "\n",
+ "spark = (SparkSession.builder\n",
+ " .master(\"local[2]\")\n",
+ " .appName(\"sklearn-deploy\")\n",
+ " .config(\"spark.ui.enabled\", \"false\")\n",
+ " .getOrCreate()\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Read the df, select relevant columns and drop any NaNs\n",
+ "df = (\n",
+ " pd.read_csv(TITANIC_URL)[NUMERICAL_FEATURES + CATEGORICAL_FEATURES + [TARGET]]\n",
+ " .dropna()\n",
+ ")\n",
+ "\n",
+ "for num_feat in NUMERICAL_FEATURES:\n",
+ " df[num_feat] = df[num_feat].astype(float)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# We have a number of numerical features: sibsp, parch and age.\n",
+ "# And a two categorical features: cabin and sex\n",
+ "# We will use those features to predict fare\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# In order to deploy our python model in PySpark we need a PySpark DataFrame\n",
+ "ddf = spark.createDataFrame(df)\n",
+ "ddf.show(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# (i) Deploying a simple Random Forest\n",
+ "----\n",
+ "\n",
+ "To keep things simple we will start by using our predefined `NUMERICAL_FEATURES` to predict \"fare\". The cell below fits the model to all the data, in practice this is not advisable. Our goal is simply to create an object that is capable of making predictions, in the context of this quality of those predictions is of no interest.\n",
+ "\n",
+ "`spark_predict` is used to deploy our model in PySpark. The function is a wrapper around a `pandas_udf`, a wrapper is used to enable a python ml model to be passed to the `pandas_udf`. The function is based on the excellent blog post [\"Prediction at Scale with scikit-learn and PySpark Pandas UDFs\"](https://medium.com/civis-analytics/prediction-at-scale-with-scikit-learn-and-pyspark-pandas-udfs-51d5ebfb2cd8) written by **Michael Heilman**."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def spark_predict(model, *cols) -> pyspark.sql.column:\n",
+ " \"\"\"This function deploys python ml in PySpark using the `predict` method of the `model` parameter.\n",
+ " \n",
+ " Args:\n",
+ " model: python ml model with sklearn API\n",
+ " *cols (list-like): Features used for predictions, required to be present as columns in the spark \n",
+ " DataFrame used to make predictions.\n",
+ " \"\"\"\n",
+ " @sf.pandas_udf(returnType=DoubleType())\n",
+ " def predict_pandas_udf(*cols):\n",
+ " # cols will be a tuple of pandas.Series here.\n",
+ " X = pd.concat(cols, axis=1)\n",
+ " return pd.Series(model.predict(X))\n",
+ " \n",
+ " return predict_pandas_udf(*cols)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "rf = RandomForestRegressor()\n",
+ "rf = rf.fit(df[NUMERICAL_FEATURES], df[TARGET])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Let's make some predictions for comparison against our PySpark predictions.\n",
+ "rf.predict(df[NUMERICAL_FEATURES])[:5]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Here we deploy our model in PySpark using our previously defined `spark_predict`.\n",
+ "# Upon looking at the DataFrame printed below we can see that the predictions in PySpark are same as made in python\n",
+ "(\n",
+ " ddf\n",
+ " .select(NUMERICAL_FEATURES + [TARGET])\n",
+ " .withColumn(\"prediction\", spark_predict(rf, *NUMERICAL_FEATURES).alias(\"prediction\"))\n",
+ " .show(5)\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# (ii) Deploying a Pipeline with Feature Scaling\n",
+ "----\n",
+ "\n",
+ "It is common practice to scale numerical features, so in the example below we make things a little more interesting by scaling our `NUMERICAL_FEATURES` before fitting our model and making predictions. Feature scaling is performed using sklearn's `Pipeline`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Construct and fit a `Pipeline` to our Titanic dataset\n",
+ "pipe = Pipeline(steps=[(\"scaler\", MinMaxScaler()), (\"predictor\", RandomForestRegressor())])\n",
+ "pipe = pipe.fit(df[NUMERICAL_FEATURES], df[TARGET])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Again let's make some predictions for comparison against our PySpark predictions.\n",
+ "pipe.predict(df[NUMERICAL_FEATURES])[:5]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Model deployment in PySpark using our `spark_predict` function\n",
+ "(\n",
+ " ddf\n",
+ " .select(NUMERICAL_FEATURES + [TARGET])\n",
+ " .withColumn(\"pipe_predict\", spark_predict(pipe, *NUMERICAL_FEATURES).alias(\"prediction\")).show(5)\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# (iii) Deploying a Pipeline with Mixed Feature Types\n",
+ "----\n",
+ "\n",
+ "It is not uncommon to use both categorical and numerical features in an ML model. In the next example I demonstrate how we can build an sklearn `Pipeline` capable of encoding categorical features and scaling numerical features. This pipeline is then deployed in PySpark."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# We create the preprocessing pipelines for both numeric and categorical data\n",
+ "categorical_transformer = Pipeline(steps=[(\"encoder\", OrdinalEncoder())])\n",
+ "numerical_transformer = Pipeline(steps=[(\"scaler\", MinMaxScaler())])\n",
+ "\n",
+ "preprocessor = ColumnTransformer(\n",
+ " transformers=[\n",
+ " (\"cat\", categorical_transformer, [3, 4]),\n",
+ " (\"num\", numerical_transformer, [0, 1, 2])]\n",
+ ")\n",
+ "\n",
+ "# Append random forest to preprocessing pipeline. We now have a full prediction pipeline.\n",
+ "preprocessor_pipe = Pipeline(steps=[(\"preprocessor\", preprocessor), (\"predictor\", RandomForestRegressor())])\n",
+ "preprocessor_pipe = preprocessor_pipe.fit(df[ALL_FEATURES], df[TARGET])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Again let's make some predictions to compare our PySpark deployment against\n",
+ "preprocessor_pipe.predict(df[ALL_FEATURES])[:5]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Again let's deploy our pipeline in PySpark using our `spark_predict` function\n",
+ "(\n",
+ " ddf\n",
+ " .select(ALL_FEATURES + [TARGET])\n",
+ " .withColumn(\"pipe_predict\", spark_predict(preprocessor_pipe, *ALL_FEATURES).alias(\"prediction\"))\n",
+ " .show(5)\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Summary\n",
+ "----\n",
+ "The `spark_predict` function defined in this notebook is a versatile solution to python ml deployment in PySpark. \n",
+ "We have demonstrated it's use in three **deployment** examples:\n",
+ "- Simple `RandomForestRegressor`\n",
+ "- Ml `Pipeline` that scales numerical features\n",
+ "- ML `Pipeline` that is capable of processing mixed feature types"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.0"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}