Skip to the content.

Model Serving

ℹ️

This spike lists and compares some open-source model serving options or tools available based on ease of implementation, compatibility with PyTorch/Sklearn, and Kubernetes integration, and provides a recommendation for the most suitable tool for our use case.

We are particularly looking for real-time or online inference rather than batch inference. Batch (offline) inference processes a large batch of input data at once rather than processing each input data point individually in real time.


Model Serving vs Model Deployment

To make use of a standard definition, the team stuck to the terminology used here in this blog.

📓

Model Serving Runtime: Packaging a trained machine learning model into a container and setting up APIs so it can handle incoming requests. This allows the model to be used in a production environment, responding to data inputs with predictions (inference). BentoML, TorchServe, Tensorflow Serving are examples.

Model Serving Platform: An environment designed to dynamically scale the number of model containers in response to incoming traffic. Tools like KServe, Bento Cloud, and Seldon Core are examples of serving platforms. They manage the infrastructure needed to deploy and scale models efficiently, responding to varying traffic without manual intervention.

Model Deployment: The process of integrating a packaged model into a serving platform and connecting it to the broader infrastructure, such as databases and downstream services. This ensures the model can access necessary data, perform its intended functions, and deliver inference results to consumers.

The terms ‘model serving’ and ‘model deployment’ are often loosely considered to have the same meaning, and some documents use them interchangeably.




Table of Contents

  1. MLFlow For Model Serving
  2. Model Serving using FastAPI
  3. BentoML for Model serving
  4. Summary Table
  5. Resources

1. MLFlow For Model Serving

MLFlow supports a variety of model deployment targets including Local Infra, AWS Sagemaker, Azure ML, Databricks, Kubernetes, etc. But we will be looking into the Kubernetes deployment here.

MLFlow - Model Serving Runtime

MLFlow - Model Serving Platforms

ℹ️

The prefered option here when deploying ML models in the MLFlow ecosystem is to use MLServer as serving runtime and Kserve as serving platform which are natively supported by MLFlow.


MLFlow - Deploying MLFlow model to Kubernetes

Kubernetes Model Deploy: Packaging and Dependencies

KServe install

MLFlow - Deploying MLFlow model to Kubernetes

The prerequisite to deploy a model to kubernetes is packaging the model as MLFlow Model mentioned in Packaging and Dependencies (linked above).

This is what we are already doing during the end of model training process.

An MLflow Model already packages your model and its dependencies, hence MLflow can create either a virtual environment (for local deployment) or a Docker container image containing everything needed to run your model. So we don’t need to bind the dependencies separately.

Once we have the model ready, deploying the model to Kserve can be done in two ways (methods linked in Resources); either using a docker image or using a model uri.

Summarised steps for the docker image based approach:

1. Install the MLServer using
pip install mlflow[extras]

2. Install Kserve to the kubernetes cluster (KServe install linked above)

3. Test the model serving locally,
mlflow models serve -m runs:/{run_id_for_your_best_run}/model -p 1234 --enable-mlserver

4. Create a model docker is as simple as
mlflow models build-docker -m runs:/{run_id_for_your_best_run}/model -n {your_dockerhub_user_name}/{mlflow-model-name} --enable-mlserver

5. Push the model to docker registry

6. Write a deployment configuration yaml file

7. Deploy to the kubernetes cluster using kubectl

If using the model uri approach, we needs to specify the model URI in a remote storage URI format e.g. s3://xxx or gs://xxx. By default, MLflow stores the model in the local file system, so you need to configure MLflow to store the model in remote storage. Please refer to Artifact Store (linked in Resources) for setup instructions.


MLFlow - Summary


2. Model Serving using FastAPI

FastAPI - Model Serving Runtime

FastAPI - Model Serving Platforms

The prefered option here is use FastAPI to combine the dependencies and correct model version to create a web serving app and then use Kserve as serving platform to deploy the app.


FastAPI - Deploying MLFlow model to Kubernetes

FastAPI - Deploying MLFlow model to Kubernetes
Model serving and deployment using FastAPI is straightforward and similar to classical s/w deployments. The steps involved are as follows;

1. Export the Model from MLFlow
    - Load the model from MLFlow so that it can be loaded by FastAPI application.
    - For pytorch models - mlflow.sklearn.load_model(model_uri, dst_path=None)
    - For sklearn models - mlflow.sklearn.load_model(model_uri, dst_path=None)
    - SKlearn supports several models and it can even support custom models using the mlflow.pyfunc api
2. Create a FastAPI Application
    Build an application to serve the model
"""An example bare minimum FastAPI application."""
  
  from fastapi import FastAPI, HTTPException
  from pydantic import BaseModel
  import numpy as np
  import mlflow

  app = FastAPI()

  MODEL_URI = "model_uri"

  class PredictionRequest(BaseModel):
      data: list

  # Load the MLFlow model at startup
  model = mlflow.pytorch.load_model(MODEL_URI)

  @app.post("/predict")
  def predict(request: PredictionRequest):
      try:
          input_data = np.array(request.data)
          predictions = model(input_data)
          return {"predictions": predictions.tolist()}
      except Exception as err:
          raise HTTPException(status_code=500, detail=str(err))

3. Containerize the FastAPI Application
    - List the requirements in a requirements.txt file or using poetry
    - Create a Docker image for the FastAPI application with the requirements preinstalled
    - Ensure you are exposing the relevant ports and running the app when the container launches using something similar
EXPOSE 80
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]

4. Deploy the Docker Container to Kubernetes
    - Use Kubernetes to deploy and manage the container probably using the same KServe method mentioned above


FastAPI - Summary


3. BentoML for Model Serving

BentoML is an open-source model serving library for building performant and scalable AI applications with Python. It comes with everything you need for serving optimization, model packaging, and production deployment.

“Bentos” is an archive containing all the necessary components to package a model.

BentoML - Model Serving Runtime

BentoML - Model Serving Platforms

BentoML - Deploying MLFlow model to Kubernetes

BentoML - Deploying MLFlow model to Kubernetes

1. Install BentoML and Dependencies
pip install bentoml

2. Create a BentoML Service

Create a BentoML service for your model. Basically wrap the model with an api endpoint similar to the process of doing so using FastAPI. A sample not tested code;

 # sample_service.py
  
import bentoml import mlflow from transformers import pipeline
model_artefact_path = "model_uri"
model_uri = mlflow.get_artifact_uri(model_artefact_path) bento_model = bentoml.mlflow.import_model( 'mlflow_pytorch_mnist', model_uri, signatures={'predict': {'batchable': True}} )
@bentoml.service( resources={"cpu": "2"}, traffic={"timeout": 10}, ) class Summarization: def __init__(self) -> None: # Load model into pipeline self.model = bento_model
@bentoml.api def predict(self, data: str) -> str: input_data = np.array(data) predictions = self.model(input_data) return {"predictions": predictions.tolist()}

3. Build the Bento for the service:
bentoml build -f sample_service.py

4. Containerize the Bento - Use BentoML to containerize the built Bento:
bentoml containerize sample_model:latest

This command builds a Docker image for the BentoML service.
Now the Docker Container can be run locally
docker run -p 3000:3000 sample_model:latest

5. Deploy to Kubernetes with KServe
    Create a KServe InferenceService YAML file and deploy to kubernetes.

Here are the advantages and limitations of BentoML as per neptune.ai

Advantages:
    Ease of use: BentoML is one of the most straightforward frameworks to use. Since the release of 1.2, it has become possible to build a Bento with a few lines of code.
    ML Framework support: BentoML supports all the leading machine learning frameworks, such as PyTorch, Keras, TensorFlow, and scikit-learn.
    Concurrent model execution: BentoML supports fractional GPU allocation (linked in Resources). In other words, you can spawn multiple instances of a model on a single GPU to distribute the processing.
    Integration: BentoML comes with integrations for ZenML, Spark, MLflow, fast.ai, Triton Inference Server, and more.
    Flexibility: BentoML is “Pythonic” and allows you to package any pre-trained model that you can import with Python, such as Large Language Models (LLMs), Stable Diffusion, or CLIP.
    Clear documentation: The documentation is easy to read, well-structured, and contains plenty of helpful examples.
    Monitoring: BentoML integrates with ArizeAI and Prometheus metrics.

Key limitations:
    Requires extra implementation: As BentoML is “Pythonic,” you are required to implement model loading and inference methods on your own.
    Native support for high-performance runtime: BentoML runs on Python. Therefore, it is not as optimal as Tensorflow Serving or TorchServe, both of which run on backends written in C++ that are compiled to machine code. However, it is possible to use the ONNX Python API to speed up the inference time.


BentoML - Summary

Summary Table

Feature MLFlow FastAPI BentoML
Ease of Implementation Very easy Easy Easy
Compatibility with SKlearn and PyTorch Fully compatible Fully compatible Fully compatible
Integration with MLFlow Native Need to do manually - Just need to get the model from the MLFlow and use it locally within the FastAPI app Has MLFLow integration
Dependency management Yes Manual Yes, can do it with the MLFlow integration
Additional components or things to consider Install either via MLServer (pip install mlflow[extras]) or KServe to the Kubernetes cluster Manually bind dependencies and model versions to a container; Install Kserve to the kubernetes cluster Install bentoml using pip install bentoml; Install Kserve to the kubernetes cluster; NO need to follow Yatai based installation if we are using bentoML just for model serving runtime creation
Integration with Kubernetes Flawless integration with KServe and Seldon Core Easy integration once we have the container ready with any kubernetes deployment platform Flawless integration
Recommended Yes Can be considered Yes


💭

What is recommended?

It is better to start with MLFlow based deployment. Once things are in place, or time permits, or if we think the need to include preprocessing steps along with model loading and inference, we could move to the BentoML based approach.

This decision is mainly because of the simplicity of the serving option that MLFlow provides and the additional learning/knowledge that BentoML requires. Other than that, the complexities look similar as per the preliminary check.

What needs to be changed when moving from mlflow to bentoml later?




⚠️

Currently, the preprocessing pipeline used in the model training is just being saved locally during the training as an artefact. If we want to use the exact same preprocessing pipeline for inference, we may need to look for an option to log that artefact along with the model, pull it and use it before the inference. Once we have clarity on how is the deployment happening, this can be done without much difficulty.


Questions:

1 - Decision between KServe or Seldon core - Which one is more suitable for our use case?

Resources

  1. MLFlow serving Deploy MLflow Model to Kubernetes — MLflow 2.15.1 documentation

  2. Deploying Models to KServe

  3. Kserve - Home - KServe Documentation Website

  4. Kserve using MLFlow models - MLFlow - KServe Documentation Website

  5. FastAPI

  6. FastAPI vs Flask: Comparison Guide for Data Science Enthusiasts

  7. KubeFlow serving options

  8. KServe installation - GitHub - kserve/kserve: Standardized Serverless ML Inference Platform on Kubernetes

  9. Best Tools For ML Model Serving

  10. MLFlow BentoML - MLflow

  11. BentoML MLFlow dependency management - MLflow additional tips

  12. BentoML fractional GPU allocation