When you implement a data science project, you should consider that your model could age due to various factors, such as concept drift or data drift. For this reason, your project will probably need constant updates.
In the previous chapter, you learned the fundamental principles related to DevOps, which allow you to move the project from the test phase to the production phase. However, this is not enough to deal with constant updates efficiently. In fact, you may need to develop an automatic or semi-automatic procedure that allows you to pass from the build/test phase to the production phase easily without too many manual interventions.
In this chapter, you will review the basic concepts behind Continuous Integration and Continuous Delivery (CI/CD), two strategies that permit you to easily and automatically update your code and move from building/testing to production efficiently. You will also learn how you can implement CI/CD using GitLab, a very popular platform for software management. Then, you will configure GitLab to work with Comet. Finally, you will see a practical example that will help you get familiar with the described concepts.
The chapter is organized as follows:
Before moving on to the first step, let’s install the software needed to run the code implemented in this chapter.
The examples described in this chapter use the following software/tools:
We will use Python 3.8. You can download it from the official website at https://www.python.org/downloads/ and choose version 3.8.
The examples described in this chapter use the following Python packages:
We described the first two packages and how to install them in Chapter 1, An Overview of Comet. Please refer back to that for further details on installation.
A Git client is a command-line tool that allows you to communicate with a Source Control System (SCS). In this chapter, you will use the Git client to interact with GitLab. You can install it as follows:
For macOS users, open a terminal and run the following commands:
xcode-select --install
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
For more details, you can read the Homebrew official documentation, available at the following link: https://brew.sh/.
brew install git
For Ubuntu Linux users, you can open a terminal and run the following commands:
sudo apt-add-repository ppa:git-core/ppa
sudo apt-get update
sudo apt-get install git
For Windows users, you can download it from the following link: https://git-scm.com/download/win. Once you have downloaded it, you can install it by following the guided procedure.
For all users, you can verify whether Git works by running the following command in a terminal:
git --version
For more details on how to install Git, you can read the Git official documentation, available at the following link: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git.
Now that you have installed all of the software and tools needed to run the examples described in this chapter, we can move on to the first topic, which is introducing the concept of CI/CD.
DevOps best practices permit us to build a pipeline that connects the development phase with the operations phase through different steps. In the previous chapter, you learned how to deploy your Data Science project for the first time. You implemented all the steps manually by building a Docker image and then deploying it in Kubernetes. However, this described procedure does not scale if you perform daily updates to your software.
To automate the integration between the development and operation phases, we should introduce two new concepts, which are CI/CD and SCS.
This section is organized as follows:
Let’s start from the first point, which is an overview of CI/CD.
When using software in production, either a generic app or a machine learning prediction service, you (or other users) may find some bugs in the code or may want to add new features. For this reason, you should be able to define an automatic procedure that permits you to update your software continuously. In this case, you could use the CI/CD strategy.
CI involves the automation of the building and testing phases every time there is a change in the code. CD deploys the software built during the CI phase to a production-like environment. Usually, this phase requires a manual approval step to move the software to production. A more advanced mechanism involves Continuous Deployment, which transforms the manual approval step into an automatic process.
To perform CI/CD, you need an SCS that keeps track of all software changes and makes the communication between the Development and Operation phases possible, as shown in the following figure:
Figure 7.1 – The role of an SCS in DevOps
The preceding figure shows the DevOps life cycle, which was already described in Chapter 6, Integrating Comet into DevOps, where the releasing step has been substituted by the SCS. In practice, the SCS is a repository that stores the software and keeps track of its updates. The development team stores the software in an SCS and updates it regularly. The operations team downloads the software from the SCS, and when they approve it, the software moves to production. Thanks to the presence of the SCS, the CI/CD procedure is automatic.
Now that you are familiar with the concept of CI/CD, we can investigate the concept of an SCS in more detail and how it works.
An SCS, also known as a Version Control System (VCS), enables you to store your code, keep track of the code history, merge code changes, and return to the previous code version if needed. In addition, an SCS permits you to share your code with your team and work on your code locally until it is ready. Since SCS is a centralized source for your code, you can use it to easily build the DevOps life cycle.
The following figure shows the basic architecture of an SCS:
Figure 7.2 – The basic architecture of an SCS
An SCS is composed of a central server repository, which stores the code, and keeps track of all its history, as follows:
Before making any change to your code, you should always update your code from the server repository through the PULL command and from your local repository through the UPDATE command.
The described mechanism works well when multiple developers are working together because each developer works on their local machine, and when the code is ready, they save the changes to the server repository. Through an SCS, you can push your commits to other developers as well as pull their commits to your local repository.
Typically, when you want to propose some changes to your code, you do not push them to the main repository. Instead, you create a new branch that contains your changes. A branch is a version of the repository diverging from the main project because it contains some proposals for changes. Using a branch permits you to work on your code independently from its stable version available in the main repository. The operation of combining a branch with the main repository is called merging.
Now that you have learned the main concepts behind an SCS, we can move on to the next point, which is the CI/CD workflow.
The following figure shows the typical CI/CD workflow:
Figure 7.3 – A typical CI/CD workflow
Let’s suppose that you want to make some changes to your code that is already hosted on an SCS. You should do the following:
Many platforms implement the CI/CD workflow. In the next section, we will review GitLab, one of the most popular platforms in the industry.
GitLab is an SCS that permits you to store your code through a versioning system. In addition, it provides you with all of the tools to implement a CI/CD workflow. The GitLab platform is available at the following link: https://gitlab.com/. You can get started with GitLab by creating an account at the following link: https://gitlab.com/users/sign_up.
The section is organized as follows:
Let’s start with the first point, which is creating/modifying a GitLab project.
Basic operations on a GitLab project include the following steps:
Let’s investigate each step separately, starting from the first one: creating a new project.
You can create a new project in GitLab as follows:
Figure 7.4 – The address to copy to clone a GitLab project in your local file system
The address starts with https://gitlab.com/ and contains the path to your project group.
git clone <PASTE_HERE_YOUR_COPIED_ADDRESS>
At this point, you have a working copy of your GitLab empty project in your local filesystem. You can start editing it.
if __name__ == "__main__":
print('Hello World!')
git add helloWorld.py
We use the add command to add a new file to the working copy.
git commit –m "my message"
git push origin main
The keyword main indicates that you are saving the changes to the main branch. The git command might prompt you to set the upstream, which is the default remote branch for the current local branch. You can set the upstream branch directly through the –u option as follows:
git push –u origin main
So far, you have worked with the main branch. However, when you want to add new features to a stable project, it is better to work on a separate branch. So, let’s see how to create a new branch.
Let’s suppose that now you want to change the original text printed by the script as follows:
if __name__ == "__main__":
print('Hi!')
Now you want to save the changes to a new branch of the repository. You can proceed as follows:
git checkout -b new_branch
You can use the checkout keyword to create a new branch as well as the –b option to set the branch name.
git add helloWorld.py
git commit -m "modified greeting string"
git push origin new_branch
If you access the GitLab dashboard, you can see that the system has created a new branch as shown in the following figure:
Figure 7.5 – The new branch created in GitLab after the git push command
The created branch remains pending until you create a merge request. Let’s investigate how to perform it.
Figure 7.5 shows that the GitLab platform proposes to create a merge request of the branch (the green rectangle at the top of the figure, with the blue button). The following are the steps to creating a merge request:
Figure 7.6 – Approving a merge request
Once the merge request is approved, all of the changes you performed in the branch are moved to the main branch.
Now that you have learned the basic concepts of creating and modifying a project in GitLab, we can move on to the next step, which is exploring GitLab's internal structure.
To store the content of a project, GitLab uses the following three main concepts:
Let’s investigate each concept separately.
A blob (short for Binary Large Object) represents the content of a file, without its metadata or the file name. In practice, a blob of a file is the associated SHA1 hash of that file. Different files with the same content have the same blob. GitLab stores all of the blobs in a local directory called .git/objects within the GitLab project. Whenever you add a new file to the repository through the git add command, a new blob object is added to the .git/objects directory, provided that the same content is not already available in the .git/objects directory.
Blobs are organized in trees, which represent directories. Every time you run a git commit command, GitLab creates a new tree in the .git/objects directory. The tree objects contain the following:
The following piece of code shows an example of tree content:
100644 blob ce013625030ba8dba906f756967f9e9ca394464a
file1.txt
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
file2.txt
The preceding tree contains two blobs named file1.txt and file2.txt.
When you run the git commit command, GitLab also adds a commit object in the .git/objects directory. The commit object points to the tree object, as shown in the following piece of code:
tree dca98923d43cd634f4359f8a1f897bf585100cfe
author Author Name <author's computer name> 1656927878 +0200
committer Author Name <author's computer name> 1656927878 +0200
My commit message
You can see the reference to the tree, the author who performed the commit, and the commit message (My commit message in the example).
If you access the .git/objects directory, you can check the type of the object as follows:
git cat-file –t 7fdc
You can output either blob, tree, or commit file types.
git cat-file –p 7fdc
For the preceding command, you use the –p argument.
Now that you have learned the GitLab internal structure, we can move on to the next point, which is exploring GitLab concepts for CI/CD.
GitLab defines the following concepts, which implement the CI/CD workflow:
To configure the CI/CD workflow in GitLab, you should add .gitlab-ci.yml to your project. The presence of a .gitlab-ci.yml file in your repository automatically triggers a GitLab runner that executes the scripts defined in the jobs.
To define the stages of your CI/CD workflow, you can use the stages keyword in the .gitlab-ci.yml file, as shown in the following piece of code:
stages:
- build
- test
- deploy
In the preceding example, we defined three stages: build, test, and deploy. These stages are executed in sequential order, starting from the first stage to the last one.
To define a job, you can use a generic name, followed by the stage to which the job belongs, and the scripts to run, as shown in the following example:
my-job:
stage: build
script:
- echo "Hello World!"
In the preceding example, we defined a job named my-job that belongs to the build stage and simply prints to screen the sentence "Hello World!".
By default, a GitLab runner runs all of the jobs in the same stage concurrently. This process is also called a basic pipeline. However, you can customize the order of jobs, regardless of the stage they belong to, by using the needs keyword within the job definition. For example, if you want to specify that job2 must be run after job1, you can write the following code:
job1:
stage: build
script:
- echo "Hello World from Job1!"
job2:
needs:
- job1
stage: build
script:
- echo "Hello World from Job2!"
We use the needs keyword for job2 to specify its dependence on job1.
Now that you have learned the basic GitLab concepts for CI/CD, we can implement a practical example.
Let’s suppose that now you want to implement a CI/CD workflow that is triggered every time you push a change to a branch of your repository. The idea is to implement the CI/CD workflow illustrated in Figure 7.3.
As a use case, we can extend the example described in the Creating/modifying a GitLab project section by adding a new script to the test_models repository. The script simply builds a linear regression model on the well-known diabetes dataset and calculates the mean squared error metric.
To configure the CI/CD workflow for this project, you should perform the following steps:
Let’s start with the first step, which is writing the main script.
The script named linear_regression.py loads the well-known diabetes dataset provided by the scikit-learn library, trains it with a training set, and then calculates the Mean Squared Error (MSE) metric.
The following piece of code shows the code that implements the preceding steps:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
model = LinearRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test,y_pred)
print(f"MSE: {mse}")
The script simply prints the MSE value. Since the script uses the scikit-learn library, we also include a requirements.txt file in the project repository. This file simply contains the scikit-learn library name.
Now you can perform the following steps:
git add linear_regression.py
git add requirements.txt
git commit –m "added linear regression "
git push origin main
Once the push operation terminates, you should be able to see the files on your GitLab dashboard.
Now that you have written the main script, we can implement the CI/CD workflow by following the next step, which is configuring a runner.
First, we need to set up a runner, which is a process that executes jobs. In this example, we will use the shared runners provided by GitLab. However, you can implement your own custom runner as described in the GitLab official documentation, available at the following link: https://docs.gitlab.com/runner/.
To configure a runner, you can proceed as follows:
Figure 7.7 – How to enable GitLab runners
Now that you have configured the runners, you can configure the .gitlab-ci.yml file.
We need to configure .gitlab-ci.yml as follows:
image: python:3.8
Since our script is written in Python, we only need to import the Python interpreter.
stages:
- run
variables:
PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"
$CI_PROJECT_DIR contains the current directory of the project.
cache:
paths:
- .cache/pip
- venv/
The cache keyword specifies the list of files that should be cached. The paths keyword determines which files to add to the cache.
before_script:
- pip install virtualenv
- virtualenv venv
- source venv/bin/activate
- pip install -r requirements.txt
run-job:
stage: run
script:
- python linear-regression.py
Simply, the job runs the linear-regression.py script.
If everything is okay, you should see an output similar to that shown in the following figure:
Figure 7.8 – The output of a successful pipeline
If you click on the View pipeline button, you can view all of the details related to the built pipeline, as shown in the following figure:
Figure 7.9 – Details related to a successful pipeline
At the bottom part of the screen, you can see the jobs contained in the pipeline. In our case, there is just one job named run-job. You can click on the run-job button to see the output of the console, as shown in the following figure:
Figure 7.10 – The output of run-job
You can clearly see the output of the python linear-regression.py command, which shows the calculated Root Mean Squared Error (RMSE).
Note
It is possible that your pipeline fails. In fact, the first time you run a CI/CD pipeline in GitLab, you need to validate your account by providing a valid credit card. Once you fill in the required information, you should be able to run the CI/CD pipeline.
Now that you have learned how to configure the .gitlab-ci.yml file, we can move on to the next step, which is creating a merge request.
The objective of this step involves modifying the linear-regression.py script, saving the changes to a new branch, and creating a merge request.
from math import sqrt
# train the model and calculate MSE
rmse = sqrt(mse)
print(f"RMSE: {rmse}")
git checkout -b test
git commit –m "modified linear regression"
git push origin test
You can see that we have pushed changes to the test branch.
Figure 7.11 – The message appearing after pushing to a new branch
You can click on the Create merge request button to trigger a merge request. This operation triggers the CI/CD pipeline.
Figure 7.12 – An overview of the merge request
You can see details on the pipeline, and you can decide to approve and merge the test branch with the main branch.
Now that you have learned how to build the CI/CD pipeline, we can move to the last step, which is creating a release.
When your project is ready to be published and downloaded by other people, you can create a release. A release is a snapshot of your project, which also includes the installation packages and the release notes. To create a release, your project must have at least one tag, which can be, for example, the version of your project.
A release includes four zipped versions of your code, including four different formats: .zip, .tar.gz, .tar.bz2, and .tar. The release also includes a JSON file, which lists the release content, as shown in the following example:
{"release":{
"id": 5269390,
"name": "Test Models V1",
"project":{
"id": 35917223,
"name": "test_models",
"created_at": "2022-05-05T16:33:04.210Z",
"description": ""
},
"tag_name": "v1",
"created_at": "2022-07-04T19:45:04.219Z",
"milestones": [],
"description": "The first release of test models"
}
}
The JSON file includes metadata related to the name of the release, the release date, a description, and so on.
GitLab provides different ways to create a release. In this section, we will describe the following two ways to create a release:
release_job:
stage: release
image: registry.gitlab.com/gitlab-org/release-cli:latest
rules:
- if: $CI_COMMIT_TAG
release:
tag_name: '$CI_COMMIT_TAG'
In this example, you build a release only if the commit operation includes a tag. You should also make sure that your release job has access to the release-cli, so you should add it to the PATH or use the image provided by the GitLab registry, as shown in the preceding example.
To add a tag to your commit operation, you can proceed as follows:
git tag my_tag
git push origin my_tag
For more details on the other ways to create a release, you can refer to the GitLab official documentation, available at the following link: https://docs.gitlab.com/ee/user/project/releases/.
Now that you have learned the basic concepts behind GitLab, we can investigate how to integrate Comet with GitLab.
Thanks to a collaboration between Comet and GitLab, Comet experiments are fully integrated with GitLab. You can integrate Comet and GitLab in two ways as follows:
Let’s investigate the two ways separately, starting with the first one.
The following figure shows how Comet can be integrated with the CI/CD pipeline:
Figure 7.13 – Integration of Comet in the CI/CD workflow
Let’s suppose that you have changed your code to support Comet experiments. If your code is written in Python, then you have imported the comet_ml library and used it to track your experiments. You can start the CI/CD workflow by creating a new branch for your project. As usual, you push code changes, and you build and run the code. This process also triggers a connection with the Comet platform. Then, the CI/CD workflow continues as described in the preceding section. When you review and approve the code, you should also consider the output of the Comet experiments to make sure that the proposed changes improve the model.
To make the integration between Comet and GitLab work, you should configure the Comet secrets in the GitLab project. You can proceed as follows:
Figure 7.14 – The configured variables in GitLab
We have configured both variables as protected. This means that only protected branches can access them. By default, the main branch is protected. You can set a branch as protected in the Protected Branches menu, which is available by following this path from the project menu: Settings →Repository →Protected Branches →Expand. When you want to protect a branch, you should select who can modify the branch. For example, you can choose the maintainers, as shown in the following figure:
Figure 7.15 – How to protect a branch in GitLab
Now that you have learned how to run Comet in a CI/CD pipeline, we can move on to the next step, which is using webhooks.
A webhook is an automated message sent from an application to notify someone that something happened. A webhook contains a payload, that is a message, and is sent to a unique URL. You can configure Comet to work with webhooks to notify an external application that there was a change in a registered model.
The following figure shows a possible integration of Comet webhooks with the CI/CD workflow:
Figure 7.16 – A possible integration of Comet webhooks with the CI/CD pipeline
Let’s suppose that you have configured a webhook that notifies an external service whenever a change occurs in a model registered in the Comet Model Registry. When the external service receives a notification, it could notify the operations team to update the code, or it could trigger an automatic downloading of the new version of the model from Comet. In both cases, the produced action should update the code stored in the GitLab repository. Thus, you should create a new branch and follow the usual CI/CD workflow described in the preceding section.
Comet webhooks are part of the Comet REST APIs that you learned in Chapter 6, Integrating Comet into DevOps.
To configure a webhook in Comet, proceed as follows:
https://www.comet.ml/api/rest/v2/webhooks/config
payload = { "workspaceName": <YOUR_COMET_WORKSPACE>,
"modelName": <YOUR MODEL NAME>,
"webhookUrls": [{ "url": <URL to the external service>,
"header": { "Authorization": <secret token>,
"Other": "other_info"
}
}]
}
You should define the workspace name and the model name in the Comet Registry that you want to notify. Under the webhooksUrls keyword, you can specify as many external services as you want. For each service, you should specify the secret token used to access it.
import requests
headers = {'Content-Type':'application/json',
"Authorization": f"{<YOUR_COMET_API_KEY>}"
}
response = requests.post(url, headers=headers, json=payload)
Once you have configured the Comet webhook, every time there is a change in the stage of your model, a notification will be sent to your external service.
Now that you have learned how to integrate Comet with GitLab, let’s move on to a practical example that shows how to integrate Docker with the CI/CD workflow.
Let’s see how to integrate Docker with the GitLab CI/CD workflow through a practical example. As a use case, let’s use the example implemented in Chapter 6, Integrating Comet into DevOps, in the Implementing Docker section. The example built an application that tested four different classification models (random forest, decision tree, Gaussian Naive Bayes, and k-nearest neighbors), which classified diamonds cut into two categories (Gold and Silver) based on some input parameters. In addition, the application logged all of the models in Comet.
You can download the full code of the example from the GitHub repository of the book available at the following link: https://github.com/PacktPublishing/Comet-for-Data-Science/tree/main/07.
In that example, we built a Docker image with the code and we ran it. In this section, the idea is to wrap that example in a CI/CD workflow and run it in GitLab. In other words, the idea is to build and run the Docker image automatically within the CI/CD workflow, without writing all the commands manually. To do so, proceed with the following steps:
git add <file_name>
The copied files include the following:
git commit –m "initial import"
git push origin main
image: docker:latest
services:
- docker:dind
stages:
- build
- run
During the build stage, we will build the Docker image, and during the run stage, we will wrap it in a container, and we will run the code in the container.
variables:
CONTAINER_TEST_IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG
CONTAINER_RELEASE_IMAGE: $CI_REGISTRY_IMAGE:latest
CONTAINER_TEST_IMAGE contains the name of the produced image during the build stage. We build it from some system variables available in GitLab. For more details on these system variables, you can check the GitLab official documentation available at the following link: https://docs.gitlab.com/ee/ci/variables/predefined_variables.html.
before_script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
We use the docker login command described in Chapter 6, Integrating Comet into DevOps but the difference is that now we log in to the GitLab registry. We use the username and the password provided by the GitLab variables.
job-build:
stage: build
script:
- docker build --pull -t $CONTAINER_TEST_IMAGE .
- docker push $CONTAINER_TEST_IMAGE
Simply, the job builds the Docker image and saves it to the GitLab registry.
job-run:
stage: run
script:
- docker pull $CONTAINER_TEST_IMAGE
- docker run --rm -e COMET_API_KEY -e COMET_PROJECT -e COMET_WORKSPACE $CONTAINER_TEST_IMAGE
The job pulls the image from the registry and runs it in a container. As for the run command defined in Chapter 6, Integrating Comet into DevOps, we pass each environment variable separately.
If everything works as expected, you should be able to see your results in Comet. In addition, you can access the produced image by selecting your project, then Packages & Registries | Container Registry, as shown in the following figure:
Figure 7.17 – The Container Registry in GitLab with the saved image
Now that the CI/CD pipeline is configured, every time you make a change in your code and you push it to GitLab, your code is automatically deployed as a Docker container.
You have just learned how to build and run a CI/CD workflow in GitLab!
Throughout this chapter, you have reviewed some advanced concepts related to DevOps, with a focus on the CI/CD workflow. You have also learned some basic concepts related to the GitLab platform and how to use them to implement the CI/CD workflow. Then, you have learned how to integrate Comet with GitLab. Finally, you have implemented a practical example that wraps the Docker building and running processes in the CI/CD pipeline. The CI/CD workflow is very important when deploying an application because it permits you to make the software releasing process easy, quick, and automatic.
In the next chapter, we will review some basic concepts related to machine learning in general, and how to use Comet to build and run a complete machine learning example.
3.144.160.142