4

Experiment Tracking, Model Management, and Dataset Versioning

In this chapter, we will introduce a set of useful tools for experiment tracking, model management, and dataset versioning, which allows you to effectively manage deep learning (DL) projects. The tools we will be discussing in this chapter can help us track many experiments and interpret the results more efficiently, which naturally leads to a reduction in operational costs and boosts the development cycle. By the end of the chapter, you will have hands-on experience with the most popular tools and be able to select the right set of tools for your project.

In this chapter, we’re going to cover the following main topics:

  • Overview of DL project tracking
  • DL project tracking with Weights & Biases
  • DL project tracking with MLflow and DVC
  • Dataset versioning – beyond Weights & Biases, MLflow, and DVC

Technical requirements

You can download the supplemental material for this chapter from this book’s GitHub repository at https://github.com/PacktPublishing/Production-Ready-Applied-Deep-Learning/tree/main/Chapter_4.

Overview of DL project tracking

Training DL models is an iterative process that consumes a lot of time and resources. Therefore, keeping track of all experiments and consistently organizing them can prevent us from wasting our time on unnecessary operations such as training similar models repeatedly on the same set of data. In other words, having well-documented records of all model architectures and their hyperparameter sets, as well as the version of data used during experiments, can help us derive the right conclusion from the experiments, which naturally leads to the project being successful.

Components of DL project tracking

The essential components of DL project tracking are experiment tracking, model management, and dataset versioning. Let’s look at each component in detail.

Experiment tracking

The concept behind experiment tracking is simple: store the description and the motivations of each experiment so that we don’t run another set of experiments for the same purpose. Overall, effective experiment tracking will save us operational costs and allows us to derive the right conclusion from a minimal set of experimental results. One of the basic approaches for effective experiment tracking is adding a unique identifier to each experiment. The information we need to track for each experiment includes project dependencies, the definition of the model architecture, parameters used, and evaluation metrics. Experiment tracking also includes visualizing ongoing experiments in real time and being able to compare a set of experiments intuitively. For example, if we can check train and validation losses from every epoch as the model gets trained, we can identify overfitting quicker, saving some resources. Also, by comparing results and a set of changes made between two experiments, we can understand how the changes affect the model performance.

Model management

Model management goes beyond experiment tracking as it covers the full life cycle of a model: dataset information, artifacts (any data generated from training a model), the implementation of the model, evaluation metrics, and pipeline information (such as development, testing, staging, and production). Model management allows us to quickly pick up the model of interest and efficiently set up the environment in which the model can be used.

Dataset versioning

The last component of DL project tracking is dataset versioning. In many projects, datasets change over time. Changes can come from data schemas (blueprints of how the data is organized), file locations, or even from filters applied to the dataset manipulating the meaning of the underlying data. Many datasets found in the industry are structured in a complex way and often stored in multiple locations in various data formats. Therefore, changes can be more dramatic and harder to track than you anticipated. As a result, keeping a record of the changes is critical in reproducing consistent results throughout the project.

Dataset tracking can be summarized as follows: a set of data stored as an artifact should become a new version of the artifact whenever the underlying data is modified. Having said that, every artifact should have metadata that consists of important information about the dataset: when it is created, who created it, and how it is different from the previous version.

For example, a dataset with dataset versioning should be formulated as follows. The dataset should have a timestamp in its name:

dataset_<timestamp>
> metadata.json
> img1.png
> img2.png
> img3.png

As mentioned previously, the metadata should contain key information about the dataset:

{
   "created_by": "Adam"
   "created_on": "2022-01-01"
   "labelled_by": "Bob"
   "number_of_samples": 3
}

Please note that the set of information that’s tracked by metadata may be different for each project.

Tools for DL project tracking

DL tracking can be achieved in various ways, starting from simple notes in a text file, through spreadsheets, keeping the information in GitHub or dedicated web pages, to self-built platforms and external tools. Model and data artifacts can be stored as is, or more sophisticated methods can be applied to avoid redundancy and increase efficiency.

The field of DL project tracking is growing fast and is introducing new tools continuously. As a result, selecting the right tool for the underlying project is not an easy task. We must consider both business and technical constraints. While the pricing model is a basic one, the other constraints can possibly be introduced by the existing development settings; integrating the existing tools should be easy, and the infrastructure must be easy to maintain. It is also important to consider the engineering competence of the MLOps team. Having said that, the following list would be a good starting point when you’re selecting a tool for your project.

  • TensorBoard (https://www.tensorflow.org/tensorboard):
    • An open source visualization tool developed by the TensorFlow team
    • A standard tool for tracking and visualizing the experimental results
  • Weights & Biases (https://wandb.ai):
    • A cloud-based service with an effective and interactive dashboard for visualizing and organizing the experimental results
    • The server can be run locally or hosted in a private cloud
    • It provides an automated hyperparameter-tuning feature called Sweeps
    • Free for personal projects. Pricing is based on the tracking hours and storage space
  • Neptune (https://neptune.ai):
    • An online tool for monitoring and storing the artifacts from machine learning (ML) experiments
    • It can easily be integrated with the other ML tools
    • It’s known for its powerful dashboard which summarizes the experiments in real time
  • MLflow (https://mlflow.org):
    • An open source platform that offers end-to-end ML life cycle management
    • It supports both Python and R-based systems. It is often used in combination with Data Version Control (DVC)
  • SageMaker Studio (https://aws.amazon.com/sagemaker/studio/):
    • A web-based visual interface for managing ML experiments set up with SageMaker
    • The tool allows users to efficiently build, train, and deploy models by providing simple integrations to the other useful features of AWS
  • Kubeflow (https://www.kubeflow.org):
    • An open source platform designed by Google for end-to-end ML orchestration and management
    • It is also designed for deploying ML systems to various development and production environments efficiently
  • Valohai (https://valohai.com):
    • A DL management platform designed for automatic machine orchestration, version control, and data pipeline management
    • It is not free software as it’s designed for an enterprise
    • It is gaining popularity for being technology agnostic and having a responsive support team

Out of the various tools, we will cover the two most commonly used settings: Weights & Biases and MLflow combined with DVC.

Things to remember

a. The essential components of DL tracking are experiment tracking, model management, and dataset versioning. Recent DL tracking tools often have user-friendly dashboards that summarize the experimental results.

b. The field is growing and there are many tools with different advantages. Selecting the right tool involves understanding both business and technical constraints.

First, let’s look at DL project tracking with Weights & Biases (W&B).

DL project tracking with Weights & Biases

W&B is an experiment management platform that provides versioning for models and data.

W&B provides an interactive dashboard that can be embedded in Jupyter notebooks or used as a standalone web page. The simple Python API opens up the possibility for simple integration as well. Furthermore, its features focus on simplifying DL experiment management: logging and monitoring model and data versions, hyperparameter values, evaluation metrics, artifacts, and other related information.

Another interesting feature of W&B is its built-in hyperparameter search feature called Sweeps (https://docs.wandb.ai/guides/sweeps). Sweeps can easily be set up using the Python API, and the results and models can be compared interactively on the W&B web page.

Finally, W&B automatically creates reports for you that summarize and organize a set of experiments intuitively (https://docs.wandb.ai/guides/reports).

Overall, the key functionalities of W&B can be summarized as follows:

  • Experiment tracking and management
  • Artifact management
  • Model evaluation
  • Model optimization
  • Collaborative analysis

W&B is a subscription-based service, but personal accounts are free of charge.

Setting up W&B

W&B has a Python API that provides simple integration methods for many DL frameworks, including TensorFlow and PyTorch. The logged information, such as projects, teams, and the list of runs, is managed and visible online or on a self-hosted server.

The first step of setting up W&B is to install the Python API and log into the W&B server. You must create an account beforehand through https://wandb.ai:

pip install wandb

wandb login

Within your Python code, you can register a single experiment that will be called run-1 through the following line of code:

import wandb
run_1 = wandb.init(project="example-DL-Book", name="run-1") 

More precisely, the wandb.init function creates a new wandb.Run instance named run_1 within a project called example-DL-Book. If a name is not provided, W&B will generate a random two-word name for you. If the project name is empty, W&B will put your run into the Uncategorized project. All the parameters of wandb.init are listed at https://docs.wandb.ai/ref/python/init, but we would like to introduce the ones that you will mostly interact with:

  • id sets a unique ID for your run
  • resume allows you to resume an experiment without creating a new run
  • job_type allows you to assign your run to a specific type such as training, testing, validation, exploration, or any other name that can be used for grouping the runs
  • tags gives you additional flexibility for organizing your runs

When the wandb.init function is triggered, information about the run will start appearing on the W&B dashboard. You can monitor the dashboard on the W&B web page or directly in the Jupyter notebook environment, as shown in the following screenshot:

Figure 4.1 – The W&B dashboard inside a Jupyter notebook environment

Figure 4.1 – The W&B dashboard inside a Jupyter notebook environment

When the run is created, you can start logging information; the wandb.log function allows you to log any data you want. For example, you can log loss during training by adding wandb.log({"custom_loss": custom_loss}) to the training loop. Similarly, you can log validation loss and any other details that you want to keep track of.

Interestingly, W&B made this process even simpler by providing built-in logging functionalities for DL models. At the time of writing, you can find integrations for most frameworks, including Keras, PyTorch, PyTorch Lightning, TensorFlow, fast.ai, scikit-learn, SageMaker, Kubeflow, Docker, Databricks, and Ray Tune (for details, see https://docs.wandb.ai/guides/integrations).

wandb.config is an excellent place to track model hyperparameters. For any artifacts from experiments, you can use the wandb.log_artifact method (for more details, see https://docs.wandb.ai/guides/artifacts). When logging an artifact, you need to define a file path and then assign the name and type of your artifact, as shown in the following code snippet:

wandb.log_artifact(file_path, name='new_artifact', type='my_dataset')

Then, you can reuse the artifact that’s been stored, as follows:

run = wandb.init(project="example-DL-Book")
artifact = run.use_artifact('example-DL-Book/new_artifact:v0', type='my_dataset')
artifact_dir = artifact.download()

So far, you have learned how to set up wandb for your project and log metrics and artifacts of your choice individually throughout training. Interestingly, wandb provides automatic logging for many DL frameworks. In this chapter, we will take a closer look at W&B integration for Keras and PyTorch Lighting (PL).

Integrating W&B into a Keras project

In the case of Keras, integration can be achieved through the WandbCallback class. The complete version can be found in this book’s GitHub repository:

import wandb
from wandb.keras import WandbCallback
from tensorflow import keras
from tensorflow.keras import layers
wandb.init(project="example-DL-Book", name="run-1")
wandb.config = {
   "learning_rate": 0.001,
   "epochs": 50,
   "batch_size": 128
}
model = keras.Sequential()
logging_callback = WandbCallback(log_evaluation=True)
model.fit(
   x=x_train, y=y_train,
   epochs=wandb.config['epochs'],
   batch_size=wandb.config['batch_size'], 
   verbose='auto', 
   validation_data=(x_valid, y_valid),
   callbacks=[logging_callback])

As described in the previous section, key information about the models gets logged and becomes available on the W&B dashboard. You can monitor losses, evaluation metrics, and hyperparameters. Figure 4.2 shows the sample plots that are generated automatically by W&B through the preceding code:

Figure 4.2 – Sample plots generated by W&B from logged metrics

Figure 4.2 – Sample plots generated by W&B from logged metrics

Integrating W&B into a PL project is similar to integrating W&B into a Keras project.

Integrating W&B into a PyTorch Lightning project

For a project based on PL, W&B provides a custom logger and hides most of the boilerplate code. All you need to do is instantiate the WandbLogger class and pass it to the Trainer instance through logger parameter:

import pytorch_lightning as pl
from pytorch_lightning.loggers import WandbLogger
wandb_logger = WandbLogger(project="example-DL-Book")
trainer = Trainer(logger=wandb_logger)
class LitModule(LightningModule):
   def __init__(self, *args, **kwarg):
       self.save_hyperparameters()
   def training_step(self, batch, batch_idx):
       self.log("train/loss", loss)

A detailed explanation of the integration can be found at https://pytorch-lightning.readthedocs.io/en/stable/extensions/generated/pytorch_lightning.loggers.WandbLogger.html.

Things to remember

a. W&B is an experiment management platform that helps in tracking different versions of models and data. It also supports storing configurations, hyperparameters, data, and model artifacts while providing experiment tracking in real time.

b. W&B is easy to set up. It provides a built-in integration feature for many DL frameworks, including TensorFlow and PyTorch.

c. W&B can be used to perform hyperparameter tuning/model optimization.

While W&B has been dominating the field of DL project tracking, the combination of MLflow and DVC is another popular setup for a DL project.

DL project tracking with MLflow and DVC

MLflow is a popular framework that supports tracking technical dependencies, model parameters, metrics, and artifacts. The key components of MLflow are as follows:

  • Tracking: It keeps a track of result changes every time the model runs
  • Projects: It packages model code in a reproducible way
  • Models: It organizes model artifacts for future convenient deployments
  • Model Registry: It manages a full life cycle of an MLflow model
  • Plugins: It can be easily integrated with other DL frameworks as it provides flexible plugins

As you may have already noticed, there are some similarities between W&B and MLflow. However, in the case of MLflow, every experiment is linked with a set of Git commits. Git does not prevent us from saving datasets, but it shows many limitations when the datasets are large, even with an extension built for large files (Git LFS). Thus, MLflow is commonly combined with DVC, an open source version control system that solves Git limitations.

Setting up MLflow

MLflow can be installed using pip:

pip install mlflow

Similar to W&B, MLflow also provides a Python API that allows you to track hyperparameters (log_param), evaluation metrics (log_metric), and artifacts (log_artifacts):

import os
import mlflow
from mlflow import log_metric, log_param, log_artifacts
log_param("epochs", 30)
log_metric("custom", 0.6)
log_metric("custom", 0.75) # metrics can be updated
if not os.path.exists("artifact_dir"):
   os.makedirs("artifact_dir")
with open("artifact_dir/test.txt", "w") as f:
   f.write("simple example")
log_artifacts("artifact_dir")

The experiment definition can be initialized and tagged with the following code:

exp_id = mlflow.create_experiment("DLBookModel_1")
exp = mlflow.get_experiment(exp_id)
with mlflow.start_run(experiment_id=exp.experiment_id, run_name='run_1') as run:
   # logging starts here
   mlflow.set_tag('model_name', 'model1_dev')

MLflow has provided a set of tutorials that introduce its APIs: https://www.mlflow.org/docs/latest/tutorials-and-examples/tutorial.html.

Now that you are familiar with the basic usage of MLflow, we will describe how it can be integrated for Keras and PL projects.

Integrating MLflow into a Keras project

First, let’s take a look at Keras integration. Logging the details of a Keras model using MLflow can be achieved through the log_model function:

history = keras_model.fit(...)
mlflow.keras.log_model(keras_model, model_dir)

The mlflow.keras and mlflow.tensorflow modules provide a set of APIs for logging various information about Keras and TensorFlow models, respectively. For additional details, please look at https://www.mlflow.org/docs/latest/python_api/index.html.

Integrating MLflow into a PyTorch Lightning project

Similar to how W&B supports PL projects, MLflow also provides an MLFlowLogger class. This can be passed to a Trainer instance for logging the model details in MLflow:

import pytorch_lightning as pl 
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import MLFlowLogger
mlf_logger = MLFlowLogger(experiment_name="example-DL-Book ", tracking_uri="file:./ml-runs")
trainer = Trainer(logger=mlf_logger)
class DLBookModel(pl.LightningModule):
   def __init__(self):
       super(DLBookModel, self).__init__()
       ...
   def training_step(self, batch, batch_nb):
       loss = self.log("train_loss", loss, on_epoch=True)

In the preceding code, we have passed an instance of MLFlowLogger to replace the default logger of PL. The tracking_uri argument controls where the logged data goes.

Other details about PyTorch integration can be found on the official website: https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.loggers.mlflow.html.

Setting up MLflow with DVC

To use DVC to manage large datasets, you need to install it using a package manager such as pip, conda, or brew (for macOS users):

pip install dvc

All the installation options can be found at https://dvc.org/doc/install.

Managing datasets using DVC requires a set of commands to be executed in a specific order:

  1. The first step is to set up a Git repository with DVC:

    git init

    dvc init

    git commit -m 'initialize repo'

  2. Now, we need to configure the remote storage for DVC:

    dvc remote add -d myremote /tmp/dvc-storage

    git commit .dvc/config -m "Added local remote storage"

  3. Let’s create a sample data directory and fill it with some sample data:

    mkdir data

    cp example_data.csv data/

  4. At this stage, we are ready to start tracking the dataset. We just need to add our file to DVC. This operation will create an additional file, example_data.csv.dvc. In addition, the example_data.csv file gets added to .gitignore automatically so that Git no longer tracks the original file:

    dvc add data/example_data.csv

  5. Next, you need to commit and upload the example_data.csv.dvc and .gitignore files. We will tag our first dataset as v1:

    git add data/.gitignore data/example_data.csv.dvc

    git commit -m 'data tracking'

    git tag -a 'v1' -m 'test_data'

    dvc push

  6. After using the dvc push command, our data will be available on remote storage. This means we can remove the local version. To restore example_data.csv, you can simply call dvc pull:

    dvc pull data/example_data.csv.dvc

  7. When example_data.csv is modified, we need to add and push again to update the version on remote storage. We will tag the modified dataset as v2:

    dvc add data/example_data.csv

    git add data/example_data.csv.dvc

    git commit -m 'data modification description'

    git tag -a 'v2' -m 'modified test_data'

    dvc push

After executing these commands, you will have two versions of the same dataset being tracked by Git and DVC: v1 and v2.

Next, let’s look at how MLflow can be combined with DVC:

import mlflow
import dvc.api
import pandas as pd
data_path='data/example_data.csv'
repo='/Users/BookDL_demo/'
version='v2'
data_url=dvc.api.get_url(path=path, repo=repo, rev=version)
# this will fetch the right version of our data file
data = pd.read_csv(data_url)
# log important information using mlflow
mlflow.start_run()
mlflow.log_param("data_url", data_url)
mlflow.log_artifact(...)

In the preceding code snippet, mlflow.log_artifact was used to save information about specific columns for the experiment.

Overall, we can run multiple experiments through MLflow with different versions of the dataset tracked by DVC. Similar to W&B, MLflow also provides a web page where we can compare our experiments. All you need is to type the following command in the terminal:

mlflow ui

This command will start a web server hosting a web page on http://127.0.0.1:5000. The following screenshot shows the MLflow dashboard:

Figure 4.3 – The MLflow dashboard; new runs will be populated at the bottom of the page

Figure 4.3 – The MLflow dashboard; new runs will be populated at the bottom of the page

Things to remember

a. MLflow can track dependencies, model parameters, metrics, and artifacts. It is often combined with DVC for efficient dataset versioning.

b. MLflow can easily be integrated with DL frameworks, including Keras, TensorFlow, and PyTorch.

c. MLflow provides an interactive visualization where multiple experiments can be analyzed at the same time.

So far, we have learned how to manage DL projects in W&B and MLflow and DVC. In the next section, we will introduce popular tools for dataset versioning.

Dataset versioning – beyond Weights & Biases, MLflow, and DVC

Throughout this chapter, we have seen how datasets can be managed by DL project-tracking tools. In the case of W&B, we can use artifacts, while in the case of MLflow and DVC, DVC runs on top of a Git repository to track different versions of datasets, thereby solving the limitations of Git.

Are there any other methods and/or tools that are useful for dataset versioning? The simple answer is yes, but again, the more precise answer depends on the context. To make the right choice, you must consider various aspects including cost, ease of use, and integration difficulty. In this section, we will mention a few tools that we believe are worth exploring if dataset versioning is one of the critical components of your project:

  • Neptune (https://docs.neptune.ai) is a metadata store for MLOps. Neptune artifacts allow versioning to be conducted on datasets that are stored locally or in cloud.
  • Delta Lake (https://delta.io) is an open source storage abstraction that runs on top of a data lake. Delta Lake works with Apache Spark APIs and uses distributed processing to improve throughput and efficiency.

Things to remember

a. There are many data versioning tools on the market. To select the right tool, you must consider various aspects including cost, ease of use, and integration difficulty.  

b. Tools such as W&B, MLflow, DVC, Neptune, and Delta Lake can help you with dataset versioning.

With that, we have introduced popular tools for dataset versioning. The right tool differs project by project. Therefore, you must evaluate the pros and cons of each tool before integrating one into your project.

Summary

Since DL projects involve many iterations of training models and evaluation, efficiently managing experiments, models, and datasets can help the team reach its goal faster. In this chapter, we looked at the two most popular settings for DL project tracking: W&B and MLflow integrated with DVC. Both settings provide built-in support for Keras and PL, which are the two most popular DL frameworks. We have also spent some time describing tools that put more emphasis on dataset versioning: Neptune and Delta Lake. Please keep in mind that you must evaluate each tool thoroughly to select the right tool for your project.

At this point, you are familiar with the frameworks and processes for building a proof of concept and training the necessary DL model. Starting from the next chapter, we will discuss how to scale up by moving individual components of the DL pipeline to the cloud.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.233.41