Model Serving

ℹ️

This spike lists and compares some open-source model serving options or tools available based on ease of implementation, compatibility with PyTorch/Sklearn, and Kubernetes integration, and provides a recommendation for the most suitable tool for our use case.

We are particularly looking for real-time or online inference rather than batch inference. Batch (offline) inference processes a large batch of input data at once rather than processing each input data point individually in real time.

Model Serving vs Model Deployment

To make use of a standard definition, the team stuck to the terminology used here in this blog.

📓

Model Serving Runtime: Packaging a trained machine learning model into a container and setting up APIs so it can handle incoming requests. This allows the model to be used in a production environment, responding to data inputs with predictions (inference). BentoML, TorchServe, Tensorflow Serving are examples.

Model Serving Platform: An environment designed to dynamically scale the number of model containers in response to incoming traffic. Tools like KServe, Bento Cloud, and Seldon Core are examples of serving platforms. They manage the infrastructure needed to deploy and scale models efficiently, responding to varying traffic without manual intervention.

Model Deployment: The process of integrating a packaged model into a serving platform and connecting it to the broader infrastructure, such as databases and downstream services. This ensures the model can access necessary data, perform its intended functions, and deliver inference results to consumers.

The terms ‘model serving’ and ‘model deployment’ are often loosely considered to have the same meaning, and some documents use them interchangeably.

Table of Contents

MLFlow For Model Serving
Model Serving using FastAPI
BentoML for Model serving
Summary Table
- Questions
Resources

1. MLFlow For Model Serving

MLFlow supports a variety of model deployment targets including Local Infra, AWS Sagemaker, Azure ML, Databricks, Kubernetes, etc. But we will be looking into the Kubernetes deployment here.

MLFlow - Model Serving Runtime

MLFlow uses MLServer for creating a serving runtime – MLServer — MLServer Documentation
By default MLFlow uses Flask, but prefers MLServer for production – Deploy MLflow Model as a Local Inference Server — MLflow 2.15.1 documentation
Supports almost all machine learning frameworks and no vendor lock is present here unlike TorchServe or Tensorflow serve
With simple steps, the url http://<host>:<port>/invocations endpoint URL can do predictions

MLFlow - Model Serving Platforms

MLFlow native model serving using MLSerer supports KServe and Seldon Core both are kubernetes native
KServe has some example code and more documentation and details from MLFlow side - Develop ML model with MLflow and deploy to Kubernetes
- Partner documentation from KServe - MLFlow - KServe Documentation Website
Seldon core - has got no example code in MLFlow but the partner documentation is also rich - MLflow Server — seldon-core documentation

ℹ️

The prefered option here when deploying ML models in the MLFlow ecosystem is to use MLServer as serving runtime and Kserve as serving platform which are natively supported by MLFlow.

MLFlow - Deploying MLFlow model to Kubernetes

Kubernetes Model Deploy: Packaging and Dependencies

KServe install

MLFlow - Deploying MLFlow model to Kubernetes

The prerequisite to deploy a model to kubernetes is packaging the model as MLFlow Model mentioned in Packaging and Dependencies (linked above).

This is what we are already doing during the end of model training process.

An MLflow Model already packages your model and its dependencies, hence MLflow can create either a virtual environment (for local deployment) or a Docker container image containing everything needed to run your model. So we don’t need to bind the dependencies separately.

Once we have the model ready, deploying the model to Kserve can be done in two ways (methods linked in Resources); either using a docker image or using a model uri.

Summarised steps for the docker image based approach:

1. Install the MLServer using
pip install mlflow[extras]

2. Install Kserve to the kubernetes cluster (KServe install linked above)

3. Test the model serving locally,
mlflow models serve -m runs:/{run_id_for_your_best_run}/model -p 1234 --enable-mlserver

4. Create a model docker is as simple as

mlflow models build-docker -m runs:/{run_id_for_your_best_run}/model -n {your_dockerhub_user_name}/{mlflow-model-name} --enable-mlserver

5. Push the model to docker registry

6. Write a deployment configuration yaml file

7. Deploy to the kubernetes cluster using kubectl

If using the model uri approach, we needs to specify the model URI in a remote storage URI format e.g. s3://xxx or gs://xxx. By default, MLflow stores the model in the local file system, so you need to configure MLflow to store the model in remote storage. Please refer to Artifact Store (linked in Resources) for setup instructions.

MLFlow - Summary

Supports PyTorch and SKlearn models natively - MLFlow models - built-in model flavors
Models are already in supported MLFLow models format when the training completes
Easy integration with kubernetes is provided using KServe or Seldon Core
Dependencies are already taken care of
Easy serving and deployment process

2. Model Serving using FastAPI

FastAPI - Model Serving Runtime

Manually create a model serving runtime using FastAPI
Containerise the so created FastAPI app to create model endpoints
Need to create custom consistent endpoint urls

FastAPI - Model Serving Platforms

Once the container is ready, any serving platform like Seldon core or KServe(open source) can be used

The prefered option here is use FastAPI to combine the dependencies and correct model version to create a web serving app and then use Kserve as serving platform to deploy the app.

FastAPI - Deploying MLFlow model to Kubernetes

Model serving and deployment using FastAPI is straightforward and similar to classical s/w deployments. The steps involved are as follows;

1. Export the Model from MLFlow

- Load the model from MLFlow so that it can be loaded by FastAPI application.

mlflow.sklearn.load_model

mlflow.pyfunc

2. Create a FastAPI Application

Build an application to serve the model

"""An example bare minimum FastAPI application."""
  
  from fastapi import FastAPI, HTTPException
  from pydantic import BaseModel
  import numpy as np
  import mlflow

  app = FastAPI()

  MODEL_URI = "model_uri"

  class PredictionRequest(BaseModel):
      data: list

  # Load the MLFlow model at startup
  model = mlflow.pytorch.load_model(MODEL_URI)

  @app.post("/predict")
  def predict(request: PredictionRequest):
      try:
          input_data = np.array(request.data)
          predictions = model(input_data)
          return {"predictions": predictions.tolist()}
      except Exception as err:
          raise HTTPException(status_code=500, detail=str(err))

3. Containerize the FastAPI Application

- List the requirements in a requirements.txt file or using poetry

- Create a Docker image for the FastAPI application with the requirements preinstalled

- Ensure you are exposing the relevant ports and running the app when the container launches using something similar EXPOSE 80
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]

4. Deploy the Docker Container to Kubernetes

- Use Kubernetes to deploy and manage the container probably using the same KServe method mentioned above

FastAPI - Summary

Using web frameworks for building API endpoints is the straightforward way to serve models
Not only MLFlow models, any models can be served using this approach
Gets more control on the preprocessing and postprocessing here
Time consuming to create consistent APIs
Need to capture the dependencies separately for running the model
Model and its specific versions need to be manually added to the
It is hard to automate the serving process
Involves manual containerisation and then deploying to kubernetes
Follows similar deployment process to other methods

3. BentoML for Model Serving

BentoML is an open-source model serving library for building performant and scalable AI applications with Python. It comes with everything you need for serving optimization, model packaging, and production deployment.

“Bentos” is an archive containing all the necessary components to package a model.

BentoML - Model Serving Runtime

Manually create a model serving runtime container using the following steps
- Install BentoML and create a service
- Build the Bento with your model and service
- Containerize the Bento using bentoml containerize
- Push the Docker image to GHCR
Model loading and inference related logic need to be implemented

BentoML - Model Serving Platforms

BentoML recommends Bento Cloud (a paid service) to deploy models in a kubernetes native manner
But similar model serving platforms like Seldon Core or KServe (open source) can be used here

BentoML - Deploying MLFlow model to Kubernetes

1. Install BentoML and Dependencies
pip install bentoml

2. Create a BentoML Service

Create a BentoML service for your model. Basically wrap the model with an api endpoint similar to the process of doing so using FastAPI. A sample not tested code;

 # sample_service.py
  

  import bentoml
  import mlflow
  from transformers import pipeline
  

  model_artefact_path = "model_uri"
  

  model_uri = mlflow.get_artifact_uri(model_artefact_path)
  bento_model = bentoml.mlflow.import_model(
      'mlflow_pytorch_mnist',
      model_uri,
      signatures={'predict': {'batchable': True}}
  )
  

  @bentoml.service(
      resources={"cpu": "2"},
      traffic={"timeout": 10},
  )
  class Summarization:
      def __init__(self) -> None:
          # Load model into pipeline
          self.model = bento_model
          

      @bentoml.api
      def predict(self, data: str) -> str:
          input_data = np.array(data)
          predictions = self.model(input_data)
          return {"predictions": predictions.tolist()}

3. Build the Bento for the service:
bentoml build -f sample_service.py

4. Containerize the Bento - Use BentoML to containerize the built Bento:
bentoml containerize sample_model:latest

This command builds a Docker image for the BentoML service.
Now the Docker Container can be run locally
docker run -p 3000:3000 sample_model:latest

5. Deploy to Kubernetes with KServe

Create a KServe InferenceService YAML file and deploy to kubernetes.
Here are the advantages and limitations of BentoML as per neptune.ai

Advantages:

Ease of use:

ML Framework support:

Concurrent model execution:

Integration:

Flexibility:

Clear documentation:

Monitoring:

Key limitations:

Requires extra implementation:

Native support for high-performance runtime:

BentoML - Summary

Supports multiple ML frameworks
Gets more control on the preprocessing and postprocessing here similar to FastAPI
Time consuming to create consistent APIs but with pre and post processing
Optimised model serving docker
Follows similar deployment process to other methods

Summary Table

Feature	MLFlow	FastAPI	BentoML
Ease of Implementation	Very easy	Easy	Easy
Compatibility with SKlearn and PyTorch	Fully compatible	Fully compatible	Fully compatible
Integration with MLFlow	Native	Need to do manually - Just need to get the model from the MLFlow and use it locally within the FastAPI app	Has MLFLow integration
Dependency management	Yes	Manual	Yes, can do it with the MLFlow integration
Additional components or things to consider	Install either via MLServer (`pip install mlflow[extras]`) or KServe to the Kubernetes cluster	Manually bind dependencies and model versions to a container; Install Kserve to the kubernetes cluster	Install bentoml using `pip install bentoml`; Install Kserve to the kubernetes cluster; NO need to follow Yatai based installation if we are using bentoML just for model serving runtime creation
Integration with Kubernetes	Flawless integration with KServe and Seldon Core	Easy integration once we have the container ready with any kubernetes deployment platform	Flawless integration
Recommended	Yes	Can be considered	Yes

💭

What is recommended?

It is better to start with MLFlow based deployment. Once things are in place, or time permits, or if we think the need to include preprocessing steps along with model loading and inference, we could move to the BentoML based approach.

This decision is mainly because of the simplicity of the serving option that MLFlow provides and the additional learning/knowledge that BentoML requires. Other than that, the complexities look similar as per the preliminary check.

What needs to be changed when moving from mlflow to bentoml later?
A bentoml service needs to be written with a consistent endpoint

Need to ensure model versions and dependencies are correctly packed

Create a docker file using bentoml cli instead of mlflow cli

Point the deployment services to use the new docker

⚠️

Currently, the preprocessing pipeline used in the model training is just being saved locally during the training as an artefact. If we want to use the exact same preprocessing pipeline for inference, we may need to look for an option to log that artefact along with the model, pull it and use it before the inference. Once we have clarity on how is the deployment happening, this can be done without much difficulty.

Questions:

1 - Decision between KServe or Seldon core - Which one is more suitable for our use case?

A comparison - KServe vs. Seldon Core - Superwise ML Observability
KServe is open source whereas SeldonCore is expensive. And so is BentoCloud.

Resources

MLFlow serving Deploy MLflow Model to Kubernetes — MLflow 2.15.1 documentation
Deploying Models to KServe
Kserve - Home - KServe Documentation Website
Kserve using MLFlow models - MLFlow - KServe Documentation Website
FastAPI
FastAPI vs Flask: Comparison Guide for Data Science Enthusiasts
KubeFlow serving options
KServe installation - GitHub - kserve/kserve: Standardized Serverless ML Inference Platform on Kubernetes
Best Tools For ML Model Serving
MLFlow BentoML - MLflow
BentoML MLFlow dependency management - MLflow additional tips
BentoML fractional GPU allocation