# Lesson 2-3: Inference Pipelines

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zenml-io/zenbytes/blob/main/2-3_Inference_Pipelines.ipynb)

***Key Concepts:*** *Inference Pipelines*

In the last lesson, we learned how to add model deployment as a step in our ML pipeline, which allows us to automatically deploy models into production after training them. We also saw how to interact with the served model manually.

In practice, querying the model is just one of many steps you would have to perform at inference time. Whenever you receive a request, you might need to preprocess the data you received, and you might also have some postprocessing code that you want to run after your model, like converting outputs to a different format, sending alerts, etc.

That is why it makes sense to use ML pipelines not only for model training but also for inference. To prevent [training-serving skew](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew), we might want to reuse some of the steps from our training pipeline when defining the inference pipeline. This is particularly important for steps like data preprocessing, which we expect to behave similarly in both environments.

Note that the two pipelines are decoupled and can run independently from each other. In practice, the overall workflow looks like this: 
1. We run the training pipeline to train and deploy a model,
2. Whenever an inference request comes in, the inference pipeline sends data to the currently deployed model and receives the corresponding model prediction,
3. Whenever we rerun the training pipeline, a new model will be trained and deployed that will overwrite the previously deployed model (or slowly phase it out).

![Training and Inference Pipelines GIF](_assets/2-3/training_inference_pipelines.gif)

In this notebook, we will build a rudimentary inference pipeline to interact with our served model. 
The pipeline will consist of the following three steps:
1. Load a data sample
2. Load the model (prediction service)
3. Inference the model on the data sample

First, let's import zenml and rebuild and rerun the deployment pipeline from the last lesson:

In [None]:
%pip install "zenml[server]"
!zenml integration install sklearn mlflow -y
!rm -rf .zen
!zenml init
!zenml experiment-tracker register mlflow_tracker --flavor=mlflow
!zenml model-deployer register mlflow --flavor=mlflow
!zenml stack register mlflow_stack -a default -o default -d mlflow -e mlflow_tracker
!zenml stack set mlflow_stack

%pip install pyparsing==2.4.2 # required for Colab

import IPython

# automatically restart kernel
IPython.Application.instance().kernel.do_shutdown(restart=True)

In [None]:
from zenml.environment import Environment

if Environment.in_google_colab(): # Colab only setup

 # clone zenbytes repo to get source code of previous lessons
 !git clone https://github.com/zenml-io/zenbytes.git # noqa
 !mv zenbytes/steps .
 !mv zenbytes/pipelines .

In [None]:
from zenml.integrations.mlflow.steps import (
 MLFlowDeployerParameters, 
 mlflow_model_deployer_step
)

from pipelines.training_pipeline import train_evaluate_deploy_pipeline
from steps.deployment_trigger import deployment_trigger
from steps.evaluator import evaluator
from steps.importer import importer
from steps.mlflow_trainer import svc_trainer_mlflow

train_evaluate_deploy_pipeline(
 importer=importer(),
 trainer=svc_trainer_mlflow(),
 evaluator=evaluator(),
 deployment_trigger=deployment_trigger(),
 model_deployer=mlflow_model_deployer_step(
 MLFlowDeployerParameters(timeout=20)
 ), # new
).run(unlisted=True)

Now we are ready to build our inference pipeline:

In [None]:
from zenml.pipelines import pipeline


@pipeline
def inference_pipeline(
 inference_data_loader,
 prediction_service_loader,
 predictor,
):
 """Basic inference pipeline."""
 inference_data = inference_data_loader()
 model_deployment_service = prediction_service_loader()
 predictor(model_deployment_service, inference_data)

In practice, the inference data loader might receive a single sample from an API request or load a batch of data from a data lake or similar. For simplicity, we will mock this component for now and just load an 8x8 random noise image.

In [None]:
import numpy as np
from zenml.steps import step


@step
def inference_data_loader() -> np.ndarray:
 """Load some inference data."""
 return np.random.rand(1, 64) # flattened 8x8 random noise image

Next, let's define the `prediction_service_loader` step. We can use the exact same code here that we used for manually querying the model service in the last lesson, just wrapped in a ZenML step:

In [None]:
from zenml.services import BaseService
from zenml.client import Client
from zenml.steps import step, Output


@step(enable_cache=False)
def prediction_service_loader() -> BaseService:
 """Load the model service of our train_evaluate_deploy_pipeline."""
 client = Client()
 model_deployer = client.active_stack.model_deployer
 services = model_deployer.find_model_server(
 pipeline_name="train_evaluate_deploy_pipeline",
 pipeline_step_name="model_deployer",
 running=True,
 )
 service = services[0]
 return service

Finally, let's write the `predictor` step that will inference our served model on the inference data sample. This step will start the service, call its `predict()` endpoint to get logits, and then perform an `argmax` operation to retrieve the class with the highest predicted probability.

In [None]:
@step
def predictor(
 service: BaseService,
 data: np.ndarray,
) -> Output(predictions=list):
 """Run a inference request against a prediction service"""
 service.start(timeout=10) # should be a NOP if already started
 prediction = service.predict(data)
 prediction = prediction.argmax(axis=-1)
 print(f"Prediction is: {[prediction.tolist()]}")
 return [prediction.tolist()]

Let's put it all together to initialize and run our inference pipeline:

In [None]:
# Initialize an inference pipeline run
my_inference_pipeline = inference_pipeline(
 inference_data_loader=inference_data_loader(),
 prediction_service_loader=prediction_service_loader(),
 predictor=predictor(),
)

my_inference_pipeline.run()

And that completes our second ZenBytes chapter on training, deployment, and inference. Our training and inference pipelines are still relatively basic, but we will add more and more features over the coming lessons.

In the next chapter on data management, we will add additional steps for data validation and drift detection to our pipelines, which are essential steps to ensure that our models receive the kind of data we expect. See you in the [next lesson](3-1_Data_Skew.ipynb)!