BridgeAI MLOps Knowledge Hub

Relevant Skills and roles

Roles
Skills
Architecture Overview
Design Decisions
Horizon Scan
Resources

Roles

The roles required to create and deploy an MLOps pipeline can vary, though our team consists of:

Data scientists
Data analysts
DevOps engineers
Machine Learning Engineers
Domain experts/business translators

Skills

Required skills include knowledge of:

Scripting languages (e.g. Python)
Cloud solutions (e.g. AWS, Azure, GCP)
CI/CD pipeline implementation and Infrastructure as Code (e.g. Terraform)
Data stores (e.g. AWS S3)
Machine learning algorithms (e.g. Logistic Regression) and frameworks (e.g. PyTorch, SK-Learn)
DAG software (e.g. Airflow)
Logging and monitoring tools (e.g. Evidently AI)
Containerization (with Kubernetes, Docker)

Architecture Overview

The components of an MLOps workflow, otherwise known as its architecture, can vary depending on the scope and restraints of a given project.

Generally, the following components make up the average MLOps workflow:

Data store and retrieval
- These can include databases for structured or unstructured data, data lakes, APIs and files (eg .csv, .parquet)
Data and feature engineering
- This step involves the use of Directed Acyclic Graphs (DAGs) to group and automate tasks organised with dependencies and relationships dictating how they should run. These can come with timing intervals, timeouts, etc.
Model training and registry
- This step involves the use of relevant tools for model versioning, storage, testing, and eventually model deployment subject to approval.
- When different iterations of models are tested, the model, its data and its hyperparameters are stored for future reference. Model registries can also store information like metadata and the lineage of a given model.
Model deployment
- When the appropriate model is selected, it is then manually pushed to the model server to serve in production. Serving a model is the act of making it accessible to end users, typically via API.
Model monitoring
- Upon a model being deployed, relevant tools are then used to monitor its performance and detect occurrences of model drift, and logging

Design Decisions:

For the pre-built MLOps pipeline created by the team, a deploy-as-model approach was taken. As such, the architecture followed by the team comprises a model registry with human intervention for stage tags, model deployment, model monitoring, data storing and retrieval, and finally data and feature engineering.

It is important to note that the components and workflow differ; the components themselves comprise the architecture, but the workflow (the order in which the components are used to create an MLOps pipeline) can vary. For further insight on the typical structure of an MLOps workflow, you can refer to the following resources:

Horizon Scan

Below are sets of comparisons for tools you can use for each component of your MLOps pipeline. The criteria for evaluation was derived using the team’s requirements, which you may also wish to consider using for your own pipeline. Each horizon scan per component comes with the team’s design decision for the component.

Data Store and Retrieval

	AWS S3	APIs	Data Lakes	Databases
Description	Amazon Simple Storage Service is an object storage service allowing users to store and protect any amount of data for a range of use cases, such as data lakes, backup and restore, and big data analytics	An application programming interface is a way for two or more computer programs or components to communicate with each other. It is a type of software interface, offering a service to other pieces of software.	A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files	A database is an organised collection of data or a type of data store based on the use of a database management system, the software that interacts with end users, applications, and the database itself to capture and analyse the data.
Scalability	Highly scalable without human intervention	Ability to enable scalable communication between systems	Optimal for large amount of structured and unstructured data without predefined schemas	NoSQL/cloud databases/etc are designed for scaling. Others not so much
Cost Effectiveness	PAYG model	Option to reduce costs via modular system design for reuse of existing components/code	Often more cost effective than DBs for storing large amounts of data	Cloud based - PAYG
Security	Built in features including IAM policies and bucket policies, either can involve RBAC, keys, encryption	Option to implement robust authentication (eg OAuth, API keys) and authorization mechanisms to control access to data	Option to implement security measures like access controls and encryption	User auth, RBAC, encryption
Accessibility	Data can be accessed from anywhere via HTTP/HTTPS, allowing for smooth data retrieval via APIs for features like real-time data retrieval. S3 can also also centralised	Supports real-time data retrieval and updates	Centralised repository making data accessible to different teams within an organisation	Optimized query engines and indexes for efficient data access; easy integration with most applications and BI tools
Flexibility	Agnostic to file types and other storage formats; high fault tolerance (99.9%); allows different software systems to communicate via APIs	Allows different software systems to communicate regardless of underlying technology stack (via REST, SOAP)	Store data in raw format across wide range of file types	Supports various data models eg relational, document, key-value

Design decisions:
The team chose Amazon S3 as the data storage system. Digital Catapult is already using AWS for a few other projects and our technologists are comfortable with this technology.

An evaluation of feature store software the team considered using, and decided on, can be found in the Resources section of this page.

Data and Feature Engineering (Training) Pipeline

Data and Feature Engineering

	Apache	Prefect
Description	Airflow is an open source workflow orchestration tool used for orchestrating distributed applications. It works by scheduling jobs across different servers or nodes using DAGs (Directed Acyclic Graphs). A DAG is the core concept of Airflow, collecting Tasks together, organised with dependencies and relationships to say how they should run	Prefect decreases negative engineering by building a DAG structure with an emphasis on enabling positive with an orchestration layer for the current data stack
Cost Effectiveness	Free	Paid
Flexibility	Supports the creation of dynamic workflows through Directed Acyclic Graphs (DAGs), enabling users to define complex dependencies and task relationships. Also viable for retraining/A-B testing/CICD	Python package that makes it easier to design, test, operate, and construct complicated data applications. It has a user-friendly API that doesn’t require any configuration files or boilerplate. It allows for process orchestration and monitoring using best industry practices.
Scalability	Ease for Helm chart setup.	To run Prefect, the official Helm chart requires additional configurations to be setup.

Model Training and Registry

	MLflow	AWS SageMaker
Description	MLflow is a versatile, expandable, open-source platform for managing workflows and artifacts across the machine learning lifecycle	Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and confidently build, train, and deploy ML models into a production-ready hosted environment
Scalability	Supports distributed training and can scale with underlying infrastructure eg Kubernetes	Fully managed and automatically scales to handle large datasets and complex models.
Cost Effectiveness	Open-source allows for more control over expenses; depends on underlying infrastructure	PAYG model; comes with automatic model tuning to optimise costs
Flexibility	Modular design with components for tracking experiments, packaging artifacts into reproducible runs and deploying models; scalable depending on underlying infrastructure.	Supports multiple frameworks and offers built-in algorithms and jupyter notebooks for development.
Accessibility	Platform-agnostic; supports multiple languages and frameworks	Fully integrated with AWS, allowing for ease of utilisation of other AWS services
Integrated Features	Stage transition tags; model lineage; model file versioning; model packaging	No stage transition tags; limited model lineage; model file versioning; limited model packaging

Model Deployment and Serving

	MLflow	BentoML	FastAPI
Description	MLflow is a versatile, expandable, open-source platform for managing workflows and artifacts across the machine learning lifecycle	BentoML is a model serving framework for building AI applications with Python. It can be installed as a library with pip, or through Yatai for Kubernetes. Yatai is the Kubernetes deployment operator for BentoML.	FastAPI is a modern, fast (high-performance), web framework for building APIs with Python based on standard Python type hints.
Model Dependency Management	Seamless	Yes, through MLflow integration	Manual
Compatibility with SKLearn and PyTorch	Fully compatible with both	Fully compatible with both	Fully compatible with both
Flexibility	Flawless integration via KServe for Kubernetes	Same as MLflow, via BentoCloud.	Seamless once container set up through any Kubernetes deployment platform.

Model Monitoring

	Evidently.AI	Prometheus & Grafana
Description	Evidently.ai, a powerful open-source tool, simplifies ML Monitoring by providing pre-built reports and test suites to track data quality, data drift, and model performance	Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.	Grafana is a multi-platform open source analytics and interactive visualization web application. It can produce charts, graphs, and alerts for the web when connected to supported data sources.
Hardware metrics	Yes	Limited, for both
Model performance in production	Yes	No, for both

GitOps

	Argo CD	Flux
Description	Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes.	Flux is a set of continuous and progressive delivery solutions for Kubernetes that are open and extensible.
Architecture	Standalone application with a built-in UI and dashboard.	Set of controllers that run within Kubernetes.
User interface	Yes	More CLI-centric
Security	Role-Based Access Control (RBAC) function, single sign-on (SSO), and multi-user support.	RBAC function, selective resource access, and SSO provisions.

Horizon Scan Embed

Horizon Scan

Relevant Skills and roles

Table of Contents

Roles

Skills

Architecture Overview

Design Decisions:

Horizon Scan

Data Store and Retrieval

Data and Feature Engineering (Training) Pipeline

Model Training and Registry

Model Deployment and Serving

Model Monitoring

GitOps

Horizon Scan Embed

Resources