Skip to the content.

MLOps Best Practices

ℹ️

While best practices for MLOps is technically covered in the Prerequisites section, separate elaboration on these principles is required given their importance.


Within best practices for MLOps, there are a number of other areas where principles must be considered in order to implement an optimal AI/ML offering. The following considerations can be expanded on by clicking on them:

Objective and Metric Best Practices

Before embarking on the design and implementation of an AI/ML offering, you must first have clearly defined business objectives.

To arrive at these objectives, you must:

When developing your metrics, it is important to ensure that the process put in place to meet your business goal is reviewed thoroughly and regularly, as automation will address areas in which the current process faces challenges.

The Deployment Service Life Cycle framework provided in this hub contains a table of considerations to adequately clarify your business objectives, resource constraints (funding, time, in/tangible resources), and AI/ML use cases.

Infrastructure Best Practices

The right infrastructure must be in place to support the model before you invest time into constructing it. Best practice here includes creating a model that is self-sufficient and allows the infrastructure to be independent of it. This way, multiple features can be integrated later on.

Key best practices when designing the infrastructure include selecting the right infrastructure components that align with your scope/requirements/constraints, deciding between cloud-based and on-premise infrastructure, and ensuring that the infrastructure is scalable.

The right components can be derived from a range of containers, orchestration tools, software environments, CI/CD tools; these must be implemented step-wise regarding the flow of your ML pipeline. This hub offers a Horizon Scan to assist you with identifying the ideal tools for your infrastructure as it relates to GitOps.

When deciding between cloud-based and on-premise infrastructure, three main points organisations should consider alongside their scope, requirements and constraints are whether their choice of infrastructure is:

Cloud-based architecture falls under all three of these criteria, with cloud solution providers like AWS, Azure and GCP having pre-made, ML-specific infrastructure elements.

While on-premise infrastructure can be costly when it comes to maintenance and scalability, it provides high levels of control and security over data, systems and software maintenance.

Idealy, the scalability of your infrastructure should be configured in such a way that it enables you to continue testing your model’s features without affecting the deployed model. An optimal approach for this would be microservices architecture.

Data Best Practices

The quality of the model is contingent on the data that is properly processed and fed into it. To ensure that your model is of high quality, you must consider:

Model Best Practices

With the objectives, metrics, infrastructure and data ready or in place, the ideal model can then be chosen. Best practices surrounding model creation comprise:

Developing a robust model involves implementing appropriate validation, testing and monitoring processes for your model’s pipeline. It is also crucial that you have defined and created usable test cases (i.e. criteria for deciding on an optimal model based on chosen training metrics) for your model’s training.

The development and documentation of your model’s training metrics can be executed with platforms such as MLflow. Additionally, the use of data derived from serving your model (where retrievable) to train your models will make the model easier to deploy, as this way the model is trained for more accurate outputs given more direct data (subject to data/model/concept drift not arising).

Code Best Practices

The code that is written must execute effectively at all stages of your pipeline. All relevant actors in your MLOps team (examples of actors can be found in the Skills, Roles and Tool Horizon Scan page of this hub) must be able to read, write or execute model codes.

Where unit tests will evaluate individual features, continuous integration implementations will test the pipeline as a whole to guarantee that changes in the code will not break the model.

Best practices for writing your code include:

Resources

  1. MLOps.org

  2. AWS Well-Architected Framework Guide

  3. ML Best Practices: A Comprehensive List

  4. What is Transfer Learning