Skip to the content.

Data Versioning

Table of Contents

  1. Introduction
  2. General Comparison
  3. Data Versioning
  4. Where Data is Stored
  5. How to Version
  6. Potential Options
  7. Verdict
  8. Resources

Introduction

This page details the team’s investigation conducted to determine which version control tool is best suited for our MLOps pipeline and integrates well with our existing technology and tool stack.

The tools compared are MLFlow and DVC (Data Version Control) - the team’s repository for this component can be found here.

1. General Comparison

1.1 DVC

DVC(Data Version Control) is an open-source version control system specifically designed to handle large datasets and machine learning models. It focuses on data versioning, reproducibility, and collaboration.

DVC provides a way to manage data and machine learning models in a version-controlled manner, similar to how Git handles code.

1.2 MLFlow

MLFlow is an open-source platform designed to manage the end-to-end machine learning lifecycle, including experimentation, reproducibility, and deployment. It is particularly well-suited for tracking and managing machine learning experiments.

MLflow offers the ability to track datasets that are associated with model training events. These metadata associated with the Dataset can be stored through the use of the mlflow.log_input() API.

1.3 General Comparison Table:

Feature MLFlow DVC
Primary Focus Experiment tracking and model management Data versioning and pipeline management
Data Versioning Limited, relies on external tools Comprehensive, designed for large datasets
Experiment Tracking Yes, detailed tracking and comparison Basic, primarily through Git
Pipeline Management Basic, experimental Advanced, tracks data and code changes
Model Management Advanced, with deployment options Basic, through versioning
Integration Integrates well with various ML tools Integrates with Git and various storage backends
Reproducibility Focus on reproducibility of experiments Strong focus, tracks entire pipeline along with data and model
Ease of Use User-friendly interface, extensive documentation, simple python API Requires understanding of Git, more command-line oriented

2. Data Versioning

2.1 DVC


An image showing the different versions of data, features and model. Source: DVC

2.2 MLFlow

3. Where the Data is Stored

3.1 DVC

DVC remotes are distributed storage locations for the data sets and ML models. It is similar to Git remotes, but for cached assets. This optional feature is typically used to share or back up copies of all or some of your data. Several types are supported: Amazon S3, Google Drive, SSH, HTTP, local file systems, among others.

3.2 MLFlow

MLFlow tracking needs the following components.

1. Backend store

Experiment metadata including run ID, start & end time, parameters, metrics, etc are stored in Backend store. MLflow supports two types of storage for the backend: file-system-based like local files and database-based like PostgreSQL. Default is sqlite.

2. Artifact store

Artifact store persists typically large artifacts for each run, such as model weights (e.g. a pickled scikit-learn model, pytorch model, etc) and data files (e.g. csv or parquet files). MLflow stores artifacts in a local file (mlruns) by default, but also supports different storage options such as Amazon S3 and Azure Blob Storage.

4. How to Version

4.1 DVC

DVC uses git and similar to git commands to version data.

  1. Install DVC: pip install dvc

  2. Initialize DVC from a git repo: dvc init

  3. Configure remote (s3): dvc remote add -d myremote s3://mybucket/path/to/dvcstore

  4. Add data files or directories: dvc add training_data test_data data.csv

  5. Commit changes: git add *.dvc .gitignore and git commit -m "Add data files to DVC"

  6. Push data to the remote: dvc push

  7. Tag repo for the data version [optional]: git tag -a v1.0 -m "Version 1.0" and git push origin v1.0

  8. When pulling data from the repository: dvc pull or dvc pull <file>.dvc


ℹ️

The above can be done in the same repo where the training is happening. This will make the data tightly coupled to the training. But if we do this outside the training repo, it will make the data easily retrievable from anywhere else for other use like getting the test result or for measuring the data drift.

4.2 MLFlow

As mentioned earlier, MLFlow doesn’t really version data, but helps to track the data associated with each experiment run.

MLFlow uses the following major interfaces for data tracking.


Detailed steps:

1 - Create a dataset


import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset

dataset_source_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv" df = pd.read_csv(dataset_source_url)
# Construct an MLflow PandasDataset from the Pandas DataFrame, # and specify the web URL as the source # dataset: PandasDataset = mlflow.data.from_pandas(df, source=dataset_source_url)

2 - Log the dataset


with mlflow.start_run():
    # Log the dataset to the MLflow Run. Specify the "training" context 
    # to indicate that the dataset is used for model training
    mlflow.log_input(dataset, context="training")

3 - Retrieve the dataset when needed


# Retrieve the run, including dataset information
run = mlflow.get_run(mlflow.last_active_run().info.run_id)
dataset_info = run.inputs.dataset_inputs[0].dataset

# Load the dataset's source, which downloads the content from the source URL # to the local filesystem dataset_source = mlflow.data.get_source(dataset_info) dataset_source.load()


📓
The above way of tracking the data is tightly coupled to the training. If the data (say test data) needs to be used outside the training repo, it can be done by using the run id from MLFlow.

An alternative to using MLFlow.data might be using MLFlow’s log artifact to log the data as such to every run. But this will not track the difference, instead it just pushes the entire data used for the run to the artifact store.

5. Potential Options

5.1 Using MLFlow

Use MLFlow to just log the data used for each experiment runs. Need artifact store set up (which anyway is needed for storing the models)

Pros:

Cons:

5.2 Use dvc within train repo

This method can be used to track the data used for experiment runs.

Add the data directories or files to the dvc, and add the dvc metadata files to git from within the training repo. The train, validation and test split needs to be tracked here.

Pros:

Cons:

5.3 Use dvc for data versioning during ingestion with train, eval, and test split

With this approach the workflow is as follows:

  1. Use dvc in the data ingestion repo (with version tagging to the repo to indicate the data version)

  2. Do ETL related preprocessing of the data as usual in the repo

  3. Split the data into train, val and test and add those data separately to the dvc

  4. Push these dvc files (metadata) to a remote

  5. Git commit, tag and push the meta data files

  6. Clone or use the data ingestion repo as a submodule in the model training repo or wherever data is needed

  7. Checkout to the data version that is needed

  8. Pull the data from the remote to local

  9. Train as usual

Pros:

Cons:

Verdict

As per the comparison of the potential options, based on our use case, the team decided on DVC as our data versioning component tool for our pipeline. DVC’s features are more fit for this component of our pipeline than MLflow’s, which revolve more around tracking data than actual version control.

Resources

  1. DVC official documentation - Home

  2. MLFlow data official documentation - mlflow.data

  3. DVC remote - remote

  4. MLFlow Backend store - Backend Stores

  5. MLFlow Artifact store - Artifact Stores

  6. MLFlow Tracking server - MLflow Tracking Server