Relevant Skills and roles
Table of Contents
Roles
The roles required to create and deploy an MLOps pipeline can vary, though our team consists of:
- Data scientists
- Data analysts
- DevOps engineers
- Machine Learning Engineers
- Domain experts/business translators
Skills
Required skills include knowledge of:
- Scripting languages (e.g. Python)
- Cloud solutions (e.g. AWS, Azure, GCP)
- CI/CD pipeline implementation and Infrastructure as Code (e.g. Terraform)
- Data stores (e.g. AWS S3)
- Machine learning algorithms (e.g. Logistic Regression) and frameworks (e.g. PyTorch, SK-Learn)
- DAG software (e.g. Airflow)
- Logging and monitoring tools (e.g. Evidently AI)
- Containerization (with Kubernetes, Docker)
Architecture Overview
The components of an MLOps workflow, otherwise known as its architecture, can vary depending on the scope and restraints of a given project.
Generally, the following components make up the average MLOps workflow:
- Data store and retrieval
- These can include databases for structured or unstructured data, data lakes, APIs and files (eg
.csv
,.parquet
)
- These can include databases for structured or unstructured data, data lakes, APIs and files (eg
- Data and feature engineering
- This step involves the use of Directed Acyclic Graphs (DAGs) to group and automate tasks organised with dependencies and relationships dictating how they should run. These can come with timing intervals, timeouts, etc.
- Model training and registry
- This step involves the use of relevant tools for model versioning, storage, testing, and eventually model deployment subject to approval.
- When different iterations of models are tested, the model, its data and its hyperparameters are stored for future reference. Model registries can also store information like metadata and the lineage of a given model.
- Model deployment
- When the appropriate model is selected, it is then manually pushed to the model server to serve in production. Serving a model is the act of making it accessible to end users, typically via API.
- Model monitoring
- Upon a model being deployed, relevant tools are then used to monitor its performance and detect occurrences of model drift, and logging
Design Decisions:
For the pre-built MLOps pipeline created by the team, a deploy-as-model approach was taken. As such, the architecture followed by the team comprises a model registry with human intervention for stage tags, model deployment, model monitoring, data storing and retrieval, and finally data and feature engineering.
It is important to note that the components and workflow differ; the components themselves comprise the architecture, but the workflow (the order in which the components are used to create an MLOps pipeline) can vary. For further insight on the typical structure of an MLOps workflow, you can refer to the following resources:
Horizon Scan
Below are sets of comparisons for tools you can use for each component of your MLOps pipeline. The criteria for evaluation was derived using the team’s requirements, which you may also wish to consider using for your own pipeline. Each horizon scan per component comes with the team’s design decision for the component.
Data Store and Retrieval
AWS S3 | APIs | Data Lakes | Databases | |
---|---|---|---|---|
Description | Amazon Simple Storage Service is an object storage service allowing users to store and protect any amount of data for a range of use cases, such as data lakes, backup and restore, and big data analytics | An application programming interface is a way for two or more computer programs or components to communicate with each other. It is a type of software interface, offering a service to other pieces of software. | A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files | A database is an organised collection of data or a type of data store based on the use of a database management system, the software that interacts with end users, applications, and the database itself to capture and analyse the data. |
Scalability | Highly scalable without human intervention | Ability to enable scalable communication between systems | Optimal for large amount of structured and unstructured data without predefined schemas | NoSQL/cloud databases/etc are designed for scaling. Others not so much |
Cost Effectiveness | PAYG model | Option to reduce costs via modular system design for reuse of existing components/code | Often more cost effective than DBs for storing large amounts of data | Cloud based - PAYG |
Security | Built in features including IAM policies and bucket policies, either can involve RBAC, keys, encryption | Option to implement robust authentication (eg OAuth, API keys) and authorization mechanisms to control access to data | Option to implement security measures like access controls and encryption | User auth, RBAC, encryption |
Accessibility | Data can be accessed from anywhere via HTTP/HTTPS, allowing for smooth data retrieval via APIs for features like real-time data retrieval. S3 can also also centralised | Supports real-time data retrieval and updates | Centralised repository making data accessible to different teams within an organisation | Optimized query engines and indexes for efficient data access; easy integration with most applications and BI tools |
Flexibility | Agnostic to file types and other storage formats; high fault tolerance (99.9%); allows different software systems to communicate via APIs | Allows different software systems to communicate regardless of underlying technology stack (via REST, SOAP) | Store data in raw format across wide range of file types | Supports various data models eg relational, document, key-value |
Design decisions:
The team chose Amazon S3 as the data storage system. Digital Catapult is already using AWS for a few other projects and our technologists are comfortable with this technology.
An evaluation of feature store software the team considered using, and decided on, can be found in the Resources section of this page.
Data and Feature Engineering (Training) Pipeline
Apache | Prefect | |
---|---|---|
Description | Airflow is an open source workflow orchestration tool used for orchestrating distributed applications. It works by scheduling jobs across different servers or nodes using DAGs (Directed Acyclic Graphs). A DAG is the core concept of Airflow, collecting Tasks together, organised with dependencies and relationships to say how they should run | Prefect decreases negative engineering by building a DAG structure with an emphasis on enabling positive with an orchestration layer for the current data stack |
Cost Effectiveness | Free | Paid |
Flexibility | Supports the creation of dynamic workflows through Directed Acyclic Graphs (DAGs), enabling users to define complex dependencies and task relationships. Also viable for retraining/A-B testing/CICD | Python package that makes it easier to design, test, operate, and construct complicated data applications. It has a user-friendly API that doesn’t require any configuration files or boilerplate. It allows for process orchestration and monitoring using best industry practices. |
Scalability | Ease for Helm chart setup. | To run Prefect, the official Helm chart requires additional configurations to be setup. |
Model Training and Registry
MLflow | AWS SageMaker | |
---|---|---|
Description | MLflow is a versatile, expandable, open-source platform for managing workflows and artifacts across the machine learning lifecycle | Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and confidently build, train, and deploy ML models into a production-ready hosted environment |
Scalability | Supports distributed training and can scale with underlying infrastructure eg Kubernetes | Fully managed and automatically scales to handle large datasets and complex models. |
Cost Effectiveness | Open-source allows for more control over expenses; depends on underlying infrastructure | PAYG model; comes with automatic model tuning to optimise costs |
Flexibility | Modular design with components for tracking experiments, packaging artifacts into reproducible runs and deploying models; scalable depending on underlying infrastructure. | Supports multiple frameworks and offers built-in algorithms and jupyter notebooks for development. |
Accessibility | Platform-agnostic; supports multiple languages and frameworks | Fully integrated with AWS, allowing for ease of utilisation of other AWS services |
Integrated Features | Stage transition tags; model lineage; model file versioning; model packaging | No stage transition tags; limited model lineage; model file versioning; limited model packaging |
Model Deployment and Serving
MLflow | BentoML | FastAPI | |
---|---|---|---|
Description | MLflow is a versatile, expandable, open-source platform for managing workflows and artifacts across the machine learning lifecycle | BentoML is a model serving framework for building AI applications with Python. It can be installed as a library with pip, or through Yatai for Kubernetes. Yatai is the Kubernetes deployment operator for BentoML. | FastAPI is a modern, fast (high-performance), web framework for building APIs with Python based on standard Python type hints. |
Model Dependency Management | Seamless | Yes, through MLflow integration | Manual |
Compatibility with SKLearn and PyTorch | Fully compatible with both | Fully compatible with both | Fully compatible with both |
Flexibility | Flawless integration via KServe for Kubernetes | Same as MLflow, via BentoCloud. | Seamless once container set up through any Kubernetes deployment platform. |
Model Monitoring
Evidently.AI | Prometheus & Grafana | ||
---|---|---|---|
Description | Evidently.ai, a powerful open-source tool, simplifies ML Monitoring by providing pre-built reports and test suites to track data quality, data drift, and model performance | Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. | Grafana is a multi-platform open source analytics and interactive visualization web application. It can produce charts, graphs, and alerts for the web when connected to supported data sources. |
Hardware metrics | Yes | Limited, for both | |
Model performance in production | Yes | No, for both |
GitOps
Argo CD | Flux | |
---|---|---|
Description | Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes. | Flux is a set of continuous and progressive delivery solutions for Kubernetes that are open and extensible. |
Architecture | Standalone application with a built-in UI and dashboard. | Set of controllers that run within Kubernetes. |
User interface | Yes | More CLI-centric |
Security | Role-Based Access Control (RBAC) function, single sign-on (SSO), and multi-user support. | RBAC function, selective resource access, and SSO provisions. |