MLOps Best Practices
While best practices for MLOps is technically covered in the Prerequisites section, separate elaboration on these principles is required given their importance.
Within best practices for MLOps, there are a number of other areas where principles must be considered in order to implement an optimal AI/ML offering. The following considerations can be expanded on by clicking on them:
- Objective and Metric Best Practices
- Infrastructure Best Practices
- Data Best Practices
- Model Best Practices
- Code Best Practices
Objective and Metric Best Practices
Before embarking on the design and implementation of an AI/ML offering, you must first have clearly defined business objectives.
To arrive at these objectives, you must:
- Identify your ‘problem’ to ensure that the ML Model is necessary,
- Collect large amounts of data that align with your objective, and
- Develop clear and scalable metrics to measure success.
When developing your metrics, it is important to ensure that the process put in place to meet your business goal is reviewed thoroughly and regularly, as automation will address areas in which the current process faces challenges.
The Deployment Service Life Cycle framework provided in this hub contains a table of considerations to adequately clarify your business objectives, resource constraints (funding, time, in/tangible resources), and AI/ML use cases.
Infrastructure Best Practices
The right infrastructure must be in place to support the model before you invest time into constructing it. Best practice here includes creating a model that is self-sufficient and allows the infrastructure to be independent of it. This way, multiple features can be integrated later on.
Key best practices when designing the infrastructure include selecting the right infrastructure components that align with your scope/requirements/constraints, deciding between cloud-based and on-premise infrastructure, and ensuring that the infrastructure is scalable.
The right components can be derived from a range of containers, orchestration tools, software environments, CI/CD tools; these must be implemented step-wise regarding the flow of your ML pipeline. This hub offers a Horizon Scan to assist you with identifying the ideal tools for your infrastructure as it relates to GitOps.
When deciding between cloud-based and on-premise infrastructure, three main points organisations should consider alongside their scope, requirements and constraints are whether their choice of infrastructure is:
- cost effective in terms of time and funding,
- low-maintenance, and
- easily scalable.
Cloud-based architecture falls under all three of these criteria, with cloud solution providers like AWS, Azure and GCP having pre-made, ML-specific infrastructure elements.
While on-premise infrastructure can be costly when it comes to maintenance and scalability, it provides high levels of control and security over data, systems and software maintenance.
Idealy, the scalability of your infrastructure should be configured in such a way that it enables you to continue testing your model’s features without affecting the deployed model. An optimal approach for this would be microservices architecture.
Data Best Practices
The quality of the model is contingent on the data that is properly processed and fed into it. To ensure that your model is of high quality, you must consider:
- The quantity of your data, in that having copious amounts to pre-process allows for better model performance. If you have minimal data available, you can use transfer learning to gather data as an alternative.
- Properly implementing data pre-processing, feature engineering, and data validation as part of the ML workflow and for re-training,
- The use of exploratory data analysis for sanity and validation checks, and the implementation of model logging and monitoring processes in order to identify when to stop the pipeline’s execution and address anomalies, as well as
- Documenting/storing the data’s features for use throughout the ML lifecycle.
Model Best Practices
With the objectives, metrics, infrastructure and data ready or in place, the ideal model can then be chosen. Best practices surrounding model creation comprise:
- Developing a robust model,
- Developing and documenting model training metrics,
- Finetuning the model that will be served, and
- Monitoring and optimizing the model’s training.
Developing a robust model involves implementing appropriate validation, testing and monitoring processes for your model’s pipeline. It is also crucial that you have defined and created usable test cases (i.e. criteria for deciding on an optimal model based on chosen training metrics) for your model’s training.
The development and documentation of your model’s training metrics can be executed with platforms such as MLflow. Additionally, the use of data derived from serving your model (where retrievable) to train your models will make the model easier to deploy, as this way the model is trained for more accurate outputs given more direct data (subject to data/model/concept drift not arising).
Code Best Practices
The code that is written must execute effectively at all stages of your pipeline. All relevant actors in your MLOps team (examples of actors can be found in the Skills, Roles and Tool Horizon Scan page of this hub) must be able to read, write or execute model codes.
Where unit tests will evaluate individual features, continuous integration implementations will test the pipeline as a whole to guarantee that changes in the code will not break the model.
Best practices for writing your code include:
- Following naming conventions for variables,
- Ensuring quality in the form of readability and the ability of others to maintain and extend the code subject to changes in requirements,
- Writing productionised code,
- Deploying models in containers for easier integration, and
- Automating unit and integration tests wherever possible.