Skip to the content.

Relevant Skills and roles

Table of Contents

  1. Roles
  2. Skills
  3. Architecture Overview
  4. Design Decisions
  5. Horizon Scan
  6. Resources

Roles

The roles required to create and deploy an MLOps pipeline can vary, though our team consists of:

Skills

Required skills include knowledge of:

Architecture Overview

The components of an MLOps workflow, otherwise known as its architecture, can vary depending on the scope and restraints of a given project.

Generally, the following components make up the average MLOps workflow:

  1. Data store and retrieval
    • These can include databases for structured or unstructured data, data lakes, APIs and files (eg .csv, .parquet)
  2. Data and feature engineering
    • This step involves the use of Directed Acyclic Graphs (DAGs) to group and automate tasks organised with dependencies and relationships dictating how they should run. These can come with timing intervals, timeouts, etc.
  3. Model training and registry
    • This step involves the use of relevant tools for model versioning, storage, testing, and eventually model deployment subject to approval.
    • When different iterations of models are tested, the model, its data and its hyperparameters are stored for future reference. Model registries can also store information like metadata and the lineage of a given model.
  4. Model deployment
    • When the appropriate model is selected, it is then manually pushed to the model server to serve in production. Serving a model is the act of making it accessible to end users, typically via API.
  5. Model monitoring
    • Upon a model being deployed, relevant tools are then used to monitor its performance and detect occurrences of model drift, and logging

Design Decisions:

For the pre-built MLOps pipeline created by the team, a deploy-as-model approach was taken. As such, the architecture followed by the team comprises a model registry with human intervention for stage tags, model deployment, model monitoring, data storing and retrieval, and finally data and feature engineering.

It is important to note that the components and workflow differ; the components themselves comprise the architecture, but the workflow (the order in which the components are used to create an MLOps pipeline) can vary. For further insight on the typical structure of an MLOps workflow, you can refer to the following resources:

Horizon Scan

Below are sets of comparisons for tools you can use for each component of your MLOps pipeline. The criteria for evaluation was derived using the team’s requirements, which you may also wish to consider using for your own pipeline. Each horizon scan per component comes with the team’s design decision for the component.

Data Store and Retrieval

Data Store and Retrieval

AWS S3 APIs Data Lakes Databases
Description Amazon Simple Storage Service is an object storage service allowing users to store and protect any amount of data for a range of use cases, such as data lakes, backup and restore, and big data analytics An application programming interface is a way for two or more computer programs or components to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files A database is an organised collection of data or a type of data store based on the use of a database management system, the software that interacts with end users, applications, and the database itself to capture and analyse the data.
Scalability Highly scalable without human intervention Ability to enable scalable communication between systems Optimal for large amount of structured and unstructured data without predefined schemas NoSQL/cloud databases/etc are designed for scaling. Others not so much
Cost Effectiveness PAYG model Option to reduce costs via modular system design for reuse of existing components/code Often more cost effective than DBs for storing large amounts of data Cloud based - PAYG
Security Built in features including IAM policies and bucket policies, either can involve RBAC, keys, encryption Option to implement robust authentication (eg OAuth, API keys) and authorization mechanisms to control access to data Option to implement security measures like access controls and encryption User auth, RBAC, encryption
Accessibility Data can be accessed from anywhere via HTTP/HTTPS, allowing for smooth data retrieval via APIs for features like real-time data retrieval. S3 can also also centralised Supports real-time data retrieval and updates Centralised repository making data accessible to different teams within an organisation Optimized query engines and indexes for efficient data access; easy integration with most applications and BI tools
Flexibility Agnostic to file types and other storage formats; high fault tolerance (99.9%); allows different software systems to communicate via APIs Allows different software systems to communicate regardless of underlying technology stack (via REST, SOAP) Store data in raw format across wide range of file types Supports various data models eg relational, document, key-value


Design decisions:
The team chose Amazon S3 as the data storage system. Digital Catapult is already using AWS for a few other projects and our technologists are comfortable with this technology.

An evaluation of feature store software the team considered using, and decided on, can be found in the Resources section of this page.


Data and Feature Engineering (Training) Pipeline

Data and Feature Engineering

Apache Prefect
Description Airflow is an open source workflow orchestration tool used for orchestrating distributed applications. It works by scheduling jobs across different servers or nodes using DAGs (Directed Acyclic Graphs). A DAG is the core concept of Airflow, collecting Tasks together, organised with dependencies and relationships to say how they should run Prefect decreases negative engineering by building a DAG structure with an emphasis on enabling positive with an orchestration layer for the current data stack
Cost Effectiveness Free Paid
Flexibility Supports the creation of dynamic workflows through Directed Acyclic Graphs (DAGs), enabling users to define complex dependencies and task relationships. Also viable for retraining/A-B testing/CICD Python package that makes it easier to design, test, operate, and construct complicated data applications. It has a user-friendly API that doesn’t require any configuration files or boilerplate. It allows for process orchestration and monitoring using best industry practices.
Scalability Ease for Helm chart setup. To run Prefect, the official Helm chart requires additional configurations to be setup.


Model Training and Registry

Model Training and Registry

MLflow AWS SageMaker
Description MLflow is a versatile, expandable, open-source platform for managing workflows and artifacts across the machine learning lifecycle Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and confidently build, train, and deploy ML models into a production-ready hosted environment
Scalability Supports distributed training and can scale with underlying infrastructure eg Kubernetes Fully managed and automatically scales to handle large datasets and complex models.
Cost Effectiveness Open-source allows for more control over expenses; depends on underlying infrastructure PAYG model; comes with automatic model tuning to optimise costs
Flexibility Modular design with components for tracking experiments, packaging artifacts into reproducible runs and deploying models; scalable depending on underlying infrastructure. Supports multiple frameworks and offers built-in algorithms and jupyter notebooks for development.
Accessibility Platform-agnostic; supports multiple languages and frameworks Fully integrated with AWS, allowing for ease of utilisation of other AWS services
Integrated Features Stage transition tags; model lineage; model file versioning; model packaging No stage transition tags; limited model lineage; model file versioning; limited model packaging


Model Deployment and Serving

Model Deployment and Serving

MLflow BentoML FastAPI
Description MLflow is a versatile, expandable, open-source platform for managing workflows and artifacts across the machine learning lifecycle BentoML is a model serving framework for building AI applications with Python. It can be installed as a library with pip, or through Yatai for Kubernetes. Yatai is the Kubernetes deployment operator for BentoML. FastAPI is a modern, fast (high-performance), web framework for building APIs with Python based on standard Python type hints.
Model Dependency Management Seamless Yes, through MLflow integration Manual
Compatibility with SKLearn and PyTorch Fully compatible with both Fully compatible with both Fully compatible with both
Flexibility Flawless integration via KServe for Kubernetes Same as MLflow, via BentoCloud. Seamless once container set up through any Kubernetes deployment platform.


Model Monitoring

Model Monitoring

Evidently.AI Prometheus & Grafana
Description Evidently.ai, a powerful open-source tool, simplifies ML Monitoring by providing pre-built reports and test suites to track data quality, data drift, and model performance Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. Grafana is a multi-platform open source analytics and interactive visualization web application. It can produce charts, graphs, and alerts for the web when connected to supported data sources.
Hardware metrics Yes Limited, for both
Model performance in production Yes No, for both


GitOps

GitOps

Argo CD Flux
Description Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes. Flux is a set of continuous and progressive delivery solutions for Kubernetes that are open and extensible.
Architecture Standalone application with a built-in UI and dashboard. Set of controllers that run within Kubernetes.
User interface Yes More CLI-centric
Security Role-Based Access Control (RBAC) function, single sign-on (SSO), and multi-user support. RBAC function, selective resource access, and SSO provisions.


Horizon Scan Embed

Horizon Scan

Resources

  1. Feature store software evaluation and decision

  2. Neptune.ai

  3. MLOps.org

  4. Databricks Docs

  5. Prometheus

  6. Grafana

  7. AWS

  8. Evidently AI

  9. BentoML

  10. MLflow

  11. Airflow

  12. Prefect

  13. What is an API?

  14. What is a data lake?

  15. What is a database?

  16. Argo vs Flux