Training Pipeline

Table of Contents

Introduction
Evaluation
Team Implementation
Apache Airflow
Prefect
Resources

1. Introduction

This page examines the team’s evaluation of two tools for training pipeline implementation, being Airflow and Prefect. Links to repositories for each sub-component of the team’s training pipeline are provided; these repositories come with instructions for implementing them.

The team chose Airflow as its DAG implementation software because it is:

Widely used in the community based on the GitHub stars of a repository (denoting the quality of a project)
Being contributed to by more users (reflected in Forks of repositories)
Straightforward to set up: Helm Chart for Apache Airflow — helm-chart Documentation

On the other hand, to run Prefect the official Helm chart requires additional configurations to be setup: Welcome to Prefect

Note: Kubeflow does not have an official Helm chart.

2. Evaluation

Apache Airflow

📓

Airflow is an open source workflow orchestration tool used for orchestrating distributed applications. It works by scheduling jobs across different servers or nodes using DAGs (Directed Acyclic Graphs). A DAG is the core concept of Airflow, collecting Tasks together, organised with dependencies and relationships to say how they should run.

Features

Free
Ease for Helm chart setup
supports the creation of dynamic workflows through Directed Acyclic Graphs (DAGs), enabling users to define complex dependencies and task relationships. Also viable for retraining/A-B testing or CI/CD

Prefect

📓

Prefect decreases negative engineering by building a DAG structure with an emphasis on enabling positive with an orchestration layer for the current data stack.

Features

Paid
To run Prefect, the official Helm chart requires additional configurations to be setup.
Python package that makes it easier to design, test, operate, and construct complicated data applications. It has a user-friendly API that doesn’t require any configuration files or boilerplate. It allows for process orchestration and monitoring using best industry practices.

3. Implementation

Each of the links below will direct you to one of our repos for each process, which comes with a README to direct you on how to set up each process:

➡️ Testing DAGs on Local Kind Cluster

➡️ Data Ingestion DAG

➡️ Model Training DAG

➡️ Drift Monitoring DAG

Training Pipeline

1. Introduction

2. Evaluation

Apache Airflow

Prefect

3. Implementation

Resources