In this chapter, you will learn about MLflow in the context of creating a local environment so that you can develop your machine learning project locally with the different features provided by MLflow. This chapter is focused on machine learning engineering, and one of the most important roles of a machine learning engineer is to build up an environment where model developers and practitioners can be efficient. We will also demonstrate a hands-on example of how we can use workbenches to accomplish specific tasks.
Specifically, we will look at the following topics in this chapter:
For this chapter, you will need the following prerequisites:
The latest version of Docker Compose installed. If you don’t already have it installed, please follow the instructions at https://docs.docker.com/compose/install/.
A data science workbench is an environment to standardize the machine learning tools and practices of an organization, allowing for rapid onboarding and development of models and analytics. One critical machine learning engineering function is to support data science practitioners with tools that empower and accelerate their day-to-day activities.
In a data science team, the ability to rapidly test multiple approaches and techniques is paramount. Every day, new libraries and open source tools are created. It is common for a project to need more than a dozen libraries in order to test a new type of model. These multitudes of libraries, if not collated correctly, might cause bugs or incompatibilities in the model.
Data is at the center of a data science workflow. Having clean datasets available for developing and evaluating models is critical. With an abundance of huge datasets, specialized big data tooling is necessary to process the data. Data can appear in multiple formats and velocities for analysis or experimentation, and can be available in multiple formats and mediums. It can be available through files, the cloud, or REpresentational State Transfer (REST) application programming interfaces (APIs).
Data science is mostly a collaborative craft; it’s part of a workflow to share models and processes among team members. Invariably, one pain point that emerges from that activity is the cross-reproducibility of model development jobs among practitioners. Data scientist A shares a training script of a model that assumes version 2.6 of a library, but data scientist B is using version 2.8 environment. Tracing and fixing the issue can take hours in some cases. If this problem occurs in a production environment, it can become extremely costly to the company.
When iterating—for instance—over a model, each run contains multiple parameters that can be tweaked to improve it. Maintaining traceability of which parameter yielded a specific performance metric—such as accuracy, for instance—can be problematic if we don’t store details of the experiment in a structured manner. Going back to a specific batch of settings that produced a better model may be impossible if we only keep the latest settings during the model development phase.
The need to iterate quickly can cause many frustrations when translating prototype code to a production environment, where it can be executed in a reliable manner. For instance, if you are developing a new trading model in a Windows machine with easy access to graphics processing units (GPUs) for inference, your engineering team member may decide to reuse the existing Linux infrastructure without GPU access. This leads to a situation where your production algorithm ends up taking 5 hours and locally runs in 30 seconds, impacting the final outcome of the project.
It is clear that a data science department risks systemic technical pain if issues related to the environment and tools are not addressed upfront. To summarize, we can list the following main points as described in this section:
A data science workbench addresses the pain points described in this section by creating a structured environment where a machine learning practitioner can be empowered to develop and deploy their models reliably, with reduced friction. A no-friction environment will allow highly costly model development hours to be focused on developing and iterating models, rather than on solving tooling and data technical issues.
After having delved into the motivation for building a data science workbench for a machine learning team, we will next start designing the data science workbench based on known pain points.
In order to address common frictions for developing models in data science, as described in the previous section, we need to provide data scientists and practitioners with a standardized environment in which they can develop and manage their work. A data science workbench should allow you to quick-start a project, and the availability of an environment with a set of starting tools and frameworks allows data scientists to rapidly jump-start a project.
The data scientist and machine learning practitioner are at the center of the workbench: they should have a reliable platform that allows them to develop and add value to the organization, with their models at their fingertips.
The following diagram depicts the core features of a data science workbench:
In order to think about the design of our data science workbench and based on the diagram in Figure 3.1, we need the following core features in our data science workbench:
Important note
In this section, we will implement the foundations of a data science workbench from scratch with MLflow, with support primarily for local development. There are a couple of very opinionated and feature-rich options provided by cloud providers such as Amazon Web Services (AWS) Sagemaker, Google AI, and Azure Machine Learning (Azure ML).
Machine learning engineering teams have freedom in terms of the use cases and technologies that the team they are serving will use.
The following steps demonstrate a good workflow for development with a data science workbench:
a) Data: This will contain all the data assets of your current project
b) Notebooks: To hold all the iterative development notebooks with all the steps required to produce the model
c) Model: A folder that contains the binary model or a reference to models, potentially in binary format
d) Source Code: A folder to store the structured code component of the code and reusable libraries
e) Output: A folder for any specific outputs of the project—for instance, visualizations, reports, or predictions
Establishing a data science workbench provides a tool for acceleration and democratization of machine learning in the organization, due to standardization and efficient adoption of machine learning best practices.
We will start our workbench implementation in our chapter with sensible components used industrywide.
We will have the following components in the architecture of our development environment:
Our data science workbench design can be seen in the following diagram:
Figure 3.2 illustrates the layout of the proposed components that will underpin our data science workbench.
The usual workflow of the practitioner, once the environment is up and running, is to develop their code in Jupyter and run their experiments with MLflow support. The environment will automatically route to the right MLflow installation configured to the correct backend, as shown in Figure 3.2.
Important note
Our data science workbench, as defined in this chapter, is a complete local environment. As the book progresses, we will introduce cloud-based environments and link our workbench to shared resources.
A sample layout of the project is available in the following GitHub folder:
https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/tree/master/Chapter03/gradflow
You can see a representation of the general layout of the workbench in terms of files here:
├── Makefile
├── README.md
├── data
├── docker
├── docker-compose.yml
├── docs
├── notebooks
├── requirements.txt
├── setup.py
├── src
├── tests
└── tox.ini
The main elements of this folder structure are outlined here:
We will now move on to using our own development environment for a stock-prediction problem, based on the framework we have just built.
In this section, we will use the workbench step by step to set up a new project. Follow the instructions step by step to start up your environment and use the workbench for the stock-prediction project.
Important note
It is critical that all packages/libraries listed in the Technical requirements section are correctly installed on your local machine to enable you to follow along.
We will move on next to exploring your own development environment, based on the development environment shown in this section. Please execute the following steps:
make
$ docker ps
The following screenshot presents three Docker images: the first for Jupyter, the second for MLflow, and the third for the PostgreSQL database. The status should show Up x minutes:
The usual ports used by your workbench are listed as follows: Jupyter serves in port 8888, MLflow serves in port 5000, and PostgreSQL serves in port 5432.
In case any of the containers fail, you might want to check if the ports are used by different services. If this is the case, you will need to turn off all of the other services.
Check your Jupyter Notebooks environment at http://localhost:8888, as illustrated in the following screenshot:
You should have a usable environment, allowing you to create new notebooks file in the specified folder.
Check your MLflow environment at http://localhost:5000, as illustrated in the following screenshot:
Figure 3.5 shows your experiment tracker environment in MLflow that you will use to visualize your experiments running in MLflow.
Run a sample experiment in MLflow by running the notebook file available in /notebooks/mlflow_sample.ipynb, as illustrated in the following screenshot:
The code in Figure 3.6 imports MLflow and creates a dummy experiment manually, on the second line, using mlflow.set_experiment(‘mlflow_experiment’).
The with mlflow.start_run() line is responsible for starting and tearing down the experiment in MLflow.
In the three following lines, we log a couple of string-type test parameters, using the mlflow.log_param function. To log numeric values, we will use the mlflow.log_metric function.
Finally, we also log the entire file that executed the function to ensure traceability of the model and code that originated it, using the mlflow.log_artifact(“mlflow_example.ipynb”) function.
Check the sample runs, to confirm that the environment is working correctly. You should go back to the MLflow user interface (UI) available at http://localhost:5000 and check if the new experiment was created, as shown in the following screenshot:
Figure 3.7 displays the additional parameters that we used on our specific experiment and the specific metric named i that is visible in the Metrics column.
Next, you should click on the experiment created to have access to the details of the run we have executed so far. This is illustrated in the following screenshot:
Apart from details of the metrics, you also have access to the mlflow_example notebook file at a specific point in time.
At this stage, you have your environment running and working as expected. Next, we will update it with our own algorithm; we’ll use the one we created in Chapter 2, Your Machine Learning Project.
Let’s update the notebook file that we created in Chapter 2, ML Problem Framing, and add it to the notebook folder on your local workbench. The code excerpt is presented here:
import mlflow
class RandomPredictor(mlflow.pyfunc.PythonModel):
def __init__(self):
pass
def predict(self, context, model_input):
return model_input.apply(lambda column: random.randint(0,1))
Under the notebook folder in the notebooks/stockpred_randomizer.ipynb file, you can follow along with the integration of the preceding code excerpt in our recently created data science workbench. We will proceed as follows:
You can see on the left pane of your notebook environment that a new folder was created alongside your files to store your models. This folder will store the Conda environment and the pickled/binarized Python function of your model, as illustrated in the following screenshot:
Figure 3.14 demonstrates the creation of a random input pandas DataFrame and the use of loaded_model to predict over the input vector. We will run the experiment with the name stockpred_experiment_days_up, logging as a metric the number of days on which the market was up on each of the models, as follows:
To check the last runs of the experiment, you can look at http://localhost:5000 and check that the new experiment was created, as illustrated in the following screenshot:
You can now compare multiple runs of our algorithm and see differences in the Days Up metric, as illustrated in the following screenshot. You can choose accordingly to delve deeper on a run that you would like to have more details about:
In Figure 3.16, you can clearly see the logged details of our run—namely, the artifact model and the Days Up metric.
In order to tear down the environment properly, you must run the following command in the same folder:
make down
In this chapter, we introduced the concept of a data science workbench and explored some of the motivation behind adopting this tool as a way to accelerate our machine learning engineering practice.
We designed a data science workbench, using MLflow and adjacent technologies based on our requirements. We detailed the steps to set up your development environment with MLflow and illustrated how to use it with existing code. In later sections, we explored the workbench and added to it our stock-trading algorithm developed in the last chapter.
In the next chapter, we will focus on experimentation to improve our models with MLflow, using the workbench developed in this chapter.
In order to further your knowledge, you can consult the documentation in the following links:
3.145.151.141