Chapter 16
Amazon SageMaker

WHAT'S IN THIS CHAPTER

  • Introduction to the Amazon SageMaker service
  • Create an Amazon SageMaker notebook instance
  • Train a Scikit-learn model locally on the notebook instance
  • Train a Scikit-learn model on a dedicated instance
  • Use Amazon SageMaker's built-in algorithms
  • Create prediction endpoints

Amazon SageMaker is a fully managed web service that provides the ability to explore data, engineer features, and train machine learning models on AWS cloud infrastructure using Python code. Once you have trained your model, you can deploy the model to a cluster of dedicated compute instances and use the deployed model to get predictions one item at a time or create batch predictions on entire datasets.

Amazon SageMaker provides implementations of a number of cloud-optimized versions of machine learning algorithms such as XGBoost, factorization machines, and PCA (Principal Component Analysis), as well as the ability to create your own algorithms based on popular frameworks such as Scikit-learn, Google TensorFlow, and Apache MXNet. In this chapter, you will learn to use Amazon SageMaker to train and deploy machine learning models based on Scikit-learn and built-in algorithms.

  • NOTE   Amazon SageMaker is only free for the first two months after you have signed up for a free-tier AWS account. After that period, you will be charged for compute resources.

Key Concepts

In this section, you learn some of the key concepts you will encounter while working with Amazon SageMaker.

Programming Model

You can interact with Amazon SageMaker using one of the following methods:

  • The Amazon SageMaker SDK for Python: This is a Python SDK that provides a selection of classes that can be used to interact with Amazon SageMaker for common tasks such as creating training jobs and deploying models to prediction instances. The Python SDK is object-oriented and provides a high-level API to Amazon SageMaker. By virtue of being a high-level API, the classes in this SDK provide a convenient mechanism to control Amazon SageMaker with fewer options than you would have were you to use the AWS boto3 SDK. This SDK is commonly used from Jupyter Notebooks to interact with AWS SageMaker.
  • The AWS boto3 SDK: This is a Python SDK that provides a mechanism to interact will several popular AWS services, and not just Amazon SageMaker. The AWS boto3 SDK provides a lower-level interface to Amazon SageMaker and can be used to access Amazon SageMaker from an AWS Lambda function, or from a Jupyter Notebook.
  • Language-specific SDK: Amazon provides a number of language-specific SDKs for programming languages such as Ruby and Java that can be used to interact with Amazon SageMaker.
  • AWS CLI: You can use the AWS command-line interface to interact with Amazon SageMaker over the command line.

Amazon SageMaker Notebook Instances

Amazon SageMaker allows you to launch EC2 instances in your account that come preconfigured with a Jupyter Notebook server; a number of common Python libraries such as the Amazon SageMaker SDK, the boto3 SDK, NumPy, Pandas, Scikit-learn, and Matplotlib; and a number of different preconfigured conda kernels. These EC2 instances are referred to as Amazon SageMaker notebook instances, and the Amazon SageMaker management console provides access to the Jupyter Notebook server running on the instance. Any files that you create on your notebook instance are stored in an Elastic Block Store (EBS) storage volume that is automatically provisioned when the EC2 instance is created.

These notebook instances allow you to perform common data science tasks such as exploration and feature engineering with NumPy, Pandas, and Matplotlib. You can also use Amazon SageMaker notebook instances to programmatically create training jobs, deploy models into production, and validate the deployed models.

Training Jobs

In order to train your model on Amazon SageMaker, you need to create a training job. Training jobs create dedicated compute instances in the AWS cloud that contain model-building code, load your training data from Amazon S3, execute the model-building code on your training data, and save the trained model to Amazon S3. When the training job is complete, the compute instances that were provisioned to support the training process will automatically be terminated. You will be billed for the compute costs required to train your model. Training jobs are usually created by using the high-level Amazon SageMaker SDK from a Jupyter Notebook file that is running on a notebook instance. The compute instances that will be used to train your model are created from Docker images. Amazon SageMaker requires that model-building code and any runtime libraries are encapsulated in a Docker image with a specific filesystem structure. The compute capabilities of the instances used to train your model depend on the CPU, RAM, and GPUs allocated to the instance. Amazon SageMaker allows you to select from a number of different configurations. These configurations are called instance types and you can specify the instance type when creating the training job with the Amazon SageMaker SDK. You can get a list of available training instance types and their respective compute capabilities at https://aws.amazon.com/sagemaker/pricing/instance-types/.

Prediction Instances

After your model has been trained using a training job, you will need to deploy it into production in order to use it. Deploying a model into production involves creating one or more compute instances, deploying your model onto these instances, and providing an API that can be used to make predictions using the deployed model. Unlike instances used to train your model, instances used to support predictions are not automatically terminated and you will be billed for the time that they are active. You can deploy your model using the Amazon SageMaker management console as well as the Amazon SageMaker SDK for Python. When you deploy your model, you will be able to specify the number and type of compute instances that you wish to provision. You can get a list of available prediction instance types and their respective compute capabilities at https://aws.amazon.com/sagemaker/pricing/instance-types/.

Prediction Endpoint and Endpoint Configuration

A prediction endpoint is an HTTPS REST API endpoint that can be used to get single predictions from a deployed model. The HTTP endpoint is secured using AWS Signature V4 authentication. An endpoint configuration ties together information on the location of a trained machine learning model, type of compute instances, and the auto-scaling policy associated with the prediction endpoint. You need to create an endpoint configuration first and then use the endpoint configurations to deploy the model to a prediction endpoint. You cannot change an endpoint configuration while the configuration is associated with an active prediction endpoint. Endpoint configurations and prediction endpoints can be created using both the Amazon SageMaker management console and the Amazon SageMaker SDK from a notebook instance.

Amazon SageMaker Batch Transform

A prediction endpoint will only provide the ability to make predictions on one observation at a time. If you need to make predictions on an entire dataset, you can use Amazon SageMaker Batch Transform to create a batch prediction job from a trained model. You can learn more about creating batch predictions with Amazon SageMaker at https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-batch-transform.html.

Data Channels

Training and validating a machine learning model require different sets of data. These are commonly referred to as the training set, the test set, and the validation set. In Amazon SageMaker, these different sets of data are referred to as channels.

Data Sources and Formats

Amazon SageMaker requires that training, test, and validation data is stored in Amazon S3 buckets. The format of the data can be either CSV files or protobuf recordIO files. The latter is the preferred format.

Built-in Algorithms

Amazon SageMaker includes cloud-optimized implementations of a number of popular machine learning algorithms. If you are working in a notebook instance, you can train models using these algorithms locally on the notebook instance using the high-level Amazon SageMaker Python library. If you want to train your models on a cluster of dedicated EC2 instances, you can create these instances from Python code in your notebook file using the Amazon SageMaker Python library and algorithm-specific Docker images provided by Amazon. As of when this chapter was written, Amazon SageMaker provides implementations of the following algorithms:

  • BlazingText Algorithm
  • DeepAR Forecasting Algorithm
  • Factorization Machines Algorithm
  • Image Classification Algorithm (based on the ResNet deep learning network)
  • IP Insights Algorithm
  • K-Means Algorithm
  • K-Nearest Neighbors Algorithm
  • LDA (Latent Dirichlet Allocation) Algorithm
  • Linear Learner Algorithm
  • NTM (Neural Topic Model) Algorithm
  • Object2Vec Algorithm
  • Object Detection Algorithm (based on VGG and ResNet deep-learning networks)
  • PCA (Principal Component Analysis) Algorithm
  • RCF (Random Cut Forest) Algorithm
  • Semantic Segmentation Algorithm
  • Sequence-to-Sequence Algorithm
  • XGBoost Algorithm

Amazon SageMaker notebook instances also include sample Jupyter Notebook files that demonstrate the use of these algorithms. You can find more information on each of these algorithms at the following link: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html.

Pricing and Availability

Amazon SageMaker is available on a pay-per-use model. You pay for the Amazon EC2 compute and Amazon S3 storage requirements for your Amazon SageMaker notebook instances, Amazon SageMaker cloud model-training instances, Amazon EC2 instances that support real-time prediction endpoints, and Amazon EC2 instances that are created on demand to support batch prediction operations. Some features of this service are included in the AWS free-tier account for a period of two months after the account is created. You can get more details on the pricing model at https://aws.amazon.com/sagemaker/pricing/.

Amazon SageMaker is not available in all regions. You can get information on the regions in which it is available from the following URL: https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/.

Creating an Amazon SageMaker Notebook Instance

In this section you will learn to use the Amazon SageMaker management console to create an Amazon SageMaker notebook instance and access the Jupyter Notebook server running on that instance from your web browser. Notebook instances are fully managed ML compute EC2 instances and relevant support infrastructure that allow the instance to be reachable across the Internet.

Amazon EC2 instances use IAM role-based security to access other resources in your AWS account. The most commonly accessed resources by Amazon SageMaker notebook instances are Amazon S3 buckets that contain model-training data, validation data, and buckets into which trained model files are saved. In order for your code in your notebook instance to access another resource in your account, such as an Amazon S3 bucket, the Amazon EC2 instance will need a set of credentials that it can use to access the underlying Amazon S3 APIs. While IAM roles provide the high-level mechanism by which the Amazon EC2 instance can obtain these credentials, behind the scenes, the AWS IAM service uses another AWS service called Amazon STS (Secure Token Service) to get the actual credentials.

In most use cases involving IAM roles, you do not need to concern yourself with Amazon STS. However, because Amazon STS has both a global endpoint and region-specific endpoints, and Amazon SageMaker makes use of region-specific Amazon STS endpoints, you must ensure that the region-specific Amazon STS endpoint is enabled for the AWS region in which you create your notebook instance. If you fail to do this, the Amazon SageMaker management console will be unable to create your notebook instance.

To enable the region-specific Amazon STS endpoint that corresponds to the region in which you wish to use Amazon SageMaker, log in to the AWS management console using your root account credentials and navigate to the IAM management console. Click the Account Settings link in the menu on the left side of the IAM management console and scroll down to the section titled Security Token Service Regions (Figure 16.1).

Screenshot of enabling region-specific Amazon STS endpoints, with regions, status, and actions options.

FIGURE 16.1 Enabling region-specific Amazon STS endpoints

Ensure that the row corresponding to the region in which you intend to use Amazon SageMaker is listed as Active. If it is not active, you can use the Activate hyperlink in the same row to activate the endpoint. Once you have ensured that the region-specific Amazon STS is enabled, you can proceed to create an Amazon SageMaker notebook instance. While you can continue to use your root account credentials to access the Amazon SageMaker management console, it is recommended that you log out and log back in to the AWS management console using the dedicated sign-in link for your development IAM user account.

Once you have logged back in to the AWS management console, use the region selector to select a region where the Amazon SageMaker service is available. The screenshots in this section assume that the console is connected to the EU (Ireland) region. Select the Amazon SageMaker service from the Services drop-down list (Figure 16.2).

Screenshot of accessing the Amazon SageMaker management console.

FIGURE 16.2 Accessing the Amazon SageMaker management console

Navigate to the list of notebook instances in your Amazon SageMaker account by clicking the Notebook Instances link on the Amazon SageMaker dashboard. Click the Create Notebook Instance button to create a new Internet-facing ML compute EC2 instance in your account with a Jupyter Notebook server hosted on the instance (Figure 16.3).

Screenshot of navigating to the list of notebook instances.

FIGURE 16.3 Navigating to the list of notebook instances

You will be asked to provide a name for the notebook instance and given an opportunity to customize some aspects of the notebook instance. By default, Amazon SageMaker will create an ml.t2-medium EC2 instance in the default Virtual Private Cloud and subnet in your AWS region. Once the EC2 instance is created, Amazon SageMaker will install an Anaconda distribution on the instance with a number of common Python packages that are required for data science. Amazon SageMaker will also configure a number of conda environments on the EC2 instance that will provide you the ability to choose from a number of commonly used combinations of Python language versions and packages.

You can choose a different type of EC2 instance, but keep in mind that the usage charge varies with EC2 instances and the ml.t2-medium EC2 instance is the cheapest option. This chapter will use a notebook instance called kmeans-iris-flowers. Type this value in the name field for the notebook instance (Figure 16.4) and scroll down to the Permissions And Encryption section.

Screenshot of specifying the name of the new Amazon SageMaker notebook instance.

FIGURE 16.4 Specifying the name of the new Amazon SageMaker notebook instance

The Amazon EC2 instance that Amazon SageMaker will create for you uses IAM policies to govern what AWS resources in your account can be accessed by the code running on that instance. At the very least, the IAM role must provide access to one or more Amazon S3 buckets that will contain the data that you will use during data exploration, model building, and evaluation. Locate the IAM Role drop-down and choose the Create A New Role option (Figure 16.5).

Screenshot of creating a new IAM role for the Amazon SageMaker notebook instance.

FIGURE 16.5 Creating a new IAM role for the Amazon SageMaker notebook instance

A pop-up window will appear asking you whether you want the role to allow access to specific Amazon S3 buckets. In a production scenario, you should be very specific about the bucket names you allow access to; however, for the purposes of this chapter, select the Any S3 Bucket option and click the Create Role button to create the new IAM role (Figure 16.6).

Screenshot of specifying the permissions policy for the new IAM role for Amazon SageMaker.

FIGURE 16.6 Specifying the permissions policy for the new IAM role for Amazon SageMaker

It is worth noting that even if you do not specify any specific bucket names, the IAM role will be built with a policy document that will allow access to any Amazon S3 bucket with the word sagemaker in its name. It is also worth noting that you will need to ensure that the Amazon S3 buckets you access from the notebook instance are in the same region as your notebook instance; otherwise you will incur additional cross-region data transfer charges.

Once you click the Create Role button, the pop-up window will be dismissed and a new IAM role that begins with the words AmazonSageMaker-ExecutionRole will be created in your account, and selected in the IAM Role drop-down list (Figure 16.7).

Screenshot of new IAM role for Amazon SageMaker.

FIGURE 16.7 New IAM role for Amazon SageMaker

You can if you wish, at a later date, use the IAM management console to modify the permissions policy associated with this IAM role to control what AWS resources can be accessed from your notebook instance as well as what actions can be performed on those resources. If, in the future, you want to create a new notebook instance, you can use this same role with future instances.

Leave the rest of the options on the page at their defaults. Scroll down to the bottom of the page and click the Create Notebook Instance button to create the hosted notebook. It may take a few minutes for AWS to create a new EC2 instance and install the relevant software on the instance. Once the notebook is ready, you will see it listed in your Amazon SageMaker management console with the status of “In Service” (Figure 16.8).

Screenshot of Amazon SageMaker management console showing the new notebook instance.

FIGURE 16.8 Amazon SageMaker management console showing the new notebook instance

It is worth noting that when the notebook instance is listed as In Service, you will be billed for the compute resources utilized to maintain the notebook instance, whether you use it or not. It is therefore recommended to stop or delete notebook instances when they are not needed. The data you store on a notebook instance is stored on a general-purpose SSD volume, and you are billed for the associated storage costs. While stopping a notebook instance will ensure you are not billed for the compute resources required to host the instance, you will continue to be billed for storage costs to the SSD volume. When you restart the notebook instance, Amazon SageMaker will provision new compute resources to host your notebook and attach the SSD volume to the new virtual server. The effect of this is that data you save on your notebook instance will persist between stopping and restarting the instance; however, you will incur costs even when the notebook instance is stopped. When a notebook instance is stopped, you can change instance settings such as the IAM role associated with the instance, the EC2 instance family used to support the instance, and so on. A deleted instance does not consume any compute resources or storage resources, and any files that you have created on the notebook instance will be lost once the instance is deleted. You can use the Actions menu to manage the state of a notebook instance as well as edit the settings of stopped instances (Figure 16.9).

Screenshot of Amazon SageMaker notebook instance management.

FIGURE 16.9 Amazon SageMaker notebook instance management

When a notebook is listed as In Service, you can access the Jupyter Notebook server using the Open Jupyter link. Amazon SageMaker also allows you to use JupyterLab to manage your .ipnyb notebook files. JupyterLab provides an IDE-like environment in which you can create and manage your Python .ipynb notebook files. This book does not use JupyterLab.

Preparing Test and Training Data

Creating a machine learning solution using Amazon SageMaker requires that you first explore the training data, select your features, and prepare a test and training dataset. While you can download datasets onto your notebook instance and use them within your Python notebook files for feature engineering, the final data for model building and evaluation must reside in an Amazon S3 bucket, and both the AWS boto3 SDK for Python and the Amazon SageMaker SDK for Python provide capabilities to transfer data from your notebook instance to Amazon S3 buckets. Since an Amazon S3 bucket is a must-have item for using Amazon SageMaker, the first thing you need to do is create one or more Amazon S3 buckets in your account, in the same AWS region as the Amazon SageMaker service.

In this chapter, the examples assume you have created two buckets:

  • awsml-sagemaker-source
  • awsml-sagemaker-results

Since bucket names need to be globally unique, if you intend to replicate the examples in this chapter you will need to substitute references to these buckets with buckets from your account.

When it comes to data exploration and feature engineering, you can perform these operations using the standard Python libraries such as NumPy, Pandas, and Scikit-learn from a Jupyter Notebook file running on your local computer or running on a cloud-based Amazon SageMaker notebook instance. You may find using a local notebook instance faster and more cost-effective for data exploration and visualization tasks. If you are using the cloud-based Amazon SageMaker notebook instance for data exploration and feature engineering, keep in mind that not only will you have to upload the data to Amazon S3, but you will also have to pay the compute costs for the notebook instance, ensure you to stop the instance when you are not using it, and deal with the latency between executing Python code on a remote server and having the results delivered to your web browser.

The machine learning model that will be built in this chapter is based on the popular Iris flowers dataset. To keep this chapter focused on model building and making predictions with Amazon SageMaker, you will not perform any feature engineering on the dataset. A copy of the Kaggle version of the Iris flowers dataset as well as files that contain the test and training sets are provided with the resources that accompany this chapter.

Log in to the AWS management console using the dedicated sign-in link for your development IAM user account. Navigate to the Amazon S3 management console and locate the awsml-sagemaker-source bucket (Figure 16.10).

Screenshot of accessing the Amazon S3 bucket that will contain the training and validation files.

FIGURE 16.10 Accessing the Amazon S3 bucket that will contain the training and validation files

Locate the datasets/iris_dataset/Kaggle folder in the resources that accompany this chapter and upload the iris_train.csv and iris_test.csv files to the Amazon S3 bucket (Figure 16.11).

Screenshot of uploading the pre-split training and test data files to the Amazon S3 bucket.

FIGURE 16.11 Uploading the pre-split training and test data files to the Amazon S3 bucket

Techniques to prepare training and test datasets have been covered in Chapter 5. A Jupyter Notebook file called PreparingDatasets.ipynb is included with the resources that accompany this chapter. This file contains the Python code that was used to generate the iris_train.csv and iris_test.csv files on an instance of Jupyter Notebook running locally on the author's computer.

Training a Scikit-Learn Model on an Amazon SageMaker Notebook Instance

In this section you will train and evaluate a machine learning model on an Amazon SageMaker notebook instance using an algorithm implemented by Scikit-learn. Training on the notebook instance implies that you only have at your disposal the vCPU and vRAM capacity of the EC2 instance that is hosting the notebook server. Amazon SageMaker also allows you to use your notebook instance to create a training job that will train your model on dedicated high-performance compute instances and deploy the trained model on a cluster of high-performance compute instances that can be accessed using a REST prediction endpoint. You will learn to train models on dedicated compute instances later in this chapter.

Training a model locally on a notebook instance is only feasible if you want to use an algorithm that is implemented by a Python library such as Scikit-learn, and are in the exploratory phase of your data science project where you work with a small dataset (or subset of a larger dataset) to pick the best algorithm and hyperparameter combinations. Training on your notebook instance is faster (and cheaper) because you do not have to wait or pay for the additional compute costs involved with dedicated training instances. Training on your notebook instance is not feasible if you are training a complex deep-learning model over a large training set. Such models will benefit from high-performance parallel training that is provided by a cluster of dedicated cloud-based instances.

The model that will be implemented in this section is a classification model using the k-means clustering algorithm. The k-means algorithm is a simple algorithm that attempts to assign input observations into k clusters. The algorithm is unsupervised and does not require pre-labeled instances; you specify the number of clusters and it assigns each member of the training dataset into one of the clusters purely based on the feature variables. It is worth noting that Amazon SageMaker also provides a built-in implementation of the k-means algorithm, but you cannot use the built-in implementation to train on your local notebook instance.

Clustering algorithms are used to find groupings in data, but they can also be used to implement instance-based learning systems. Once all the observations have been assigned to the k clusters, the mathematical centroid of each cluster can be computed and stored for future predictions. The distance between the cluster centroids and a new observation for which a prediction is desired can be computed, and the cluster corresponding to the closest centroid can be returned as the predicted result. There are some variations on the manner in which the target cluster is selected, but the general idea remains the same.

The model you will build in this section will use the Iris flowers dataset, which you uploaded to an Amazon S3 bucket in the previous section. Navigate to the list of notebook instances in your Amazon SageMaker account by clicking the Notebook Instances link on the Amazon SageMaker dashboard. Launch the kmeans-iris-flowers notebook instance and create a new Jupyter Notebook file using the conda_python3 kernel (Figure 16.12).

Screenshot of creating a new Jupyter Notebook on an Amazon SageMaker notebook instance.

FIGURE 16.12 Creating a new Jupyter Notebook on an Amazon SageMaker notebook instance

A new Jupyter Notebook file called Untitled.ipnyb will be created for you on the Amazon SageMaker notebook instance. The new notebook file will also open automatically in a new browser tab. Change the title of the notebook to sklearn-local-kmeans-iris-flowers (see Figure 16.13).

Screenshot of changing the title of a Jupyter Notebook file.

FIGURE 16.13 Changing the title of a Jupyter Notebook file

Scikit-learn implements the k-means algorithm in a class called KMeans in the sklearn.cluster package. Using Scikit-learn to train machine learning models was covered in Chapter 5, and therefore this section will assume that you know how to use Scikit-learn. Type the following code in an empty notebook cell:

import boto3
import sagemaker
import io
 
import pandas as pd
import numpy as np
 
# load training and validation dataset from Amazon S3
s3_client = boto3.client('s3')
s3_bucket_name='awsml-sagemaker-source'
 
response = s3_client.get_object(Bucket='awsml-sagemaker-source', Key='iris_train.csv')
response_body = response["Body"].read()
df_iris_train = pd.read_csv(io.BytesIO(response_body), header=0, delimiter=",", low_memory=False)
 
response = s3_client.get_object(Bucket='awsml-sagemaker-source', Key='iris_test.csv')
response_body = response["Body"].read()
df_iris_test = pd.read_csv(io.BytesIO(response_body), header=0, index_col=False, delimiter=",", low_memory=False)
 
# separate training and validation dataset into separate features and target variables
# assume that the first column in each dataset is the target variable.
# training a k-means classifier does not require labelled data, and therefore
# df_iris_target_train will not be used.
df_iris_features_train = df_iris_train.iloc[:,1:]
df_iris_target_train = df_iris_train.iloc[:,0]
 
df_iris_features_test= df_iris_test.iloc[:,1:]
df_iris_target_test = df_iris_test.iloc[:,0]
 
# create a KMeans multi-class classifier.
from sklearn.cluster import KMeans
 
kmeans_model = KMeans(n_clusters=3)
kmeans_model.fit(df_iris_features_train)
 
# use the  model to create predictions on the test set
kmeans_predictions = kmeans_model.predict(df_iris_features_test)

This code reads the iris_train.csv and iris_test.csv files from the Amazon S3 bucket called awsml-sagemaker-source and loads the data in these files into Pandas dataframes. Since you are going to use the k-means clustering algorithm, you do not need the target values when training the classifier, and therefore the code trains a KMeans classifier on the contents of the df_iris_features_train dataset alone.

Once the model is trained, you use it to make predictions on the test dataset. The clusters returned by the model are integers 0, 1, and 2 and you can use the Python print() function to examine the predictions:

# print predicted classes
print (kmeans_predictions)
[0 1 2 1 1 1 1 2 1 1 2 0 1 0 2 0 0 2 2 2 1 0 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1
 2]

In the Iris dataset, the target attribute species is a categorical string. Since the k-means classifier did not use the labels associated with the training data, there is no direct relation between the cluster numbers 0, 1, and 2 as returned by the classifier and the Iris-setosa, Iris-versicolor, and Iris-virginica labels used in the Iris dataset. If you were to use the Python print() function to examine the labels associated with the test set, you would see that they are all strings:

# print expected classes
print (df_iris_target_test.values.ravel())
 
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-virginica' 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-virginica' 'Iris-virginica' 'Iris-setosa'
 'Iris-versicolor' 'Iris-setosa' 'Iris-virginica' 'Iris-setosa'
 'Iris-setosa' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
 'Iris-virginica' 'Iris-setosa' 'Iris-virginica' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-setosa' 'Iris-versicolor' 'Iris-setosa' 'Iris-versicolor'
 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-virginica']

You could use the LabelEncoder class provided by Scikit-learn to convert the target categorical attribute from string values to integers. The following snippet converts the strings Iris-setosa, Iris-versicolor, and Iris-virginica into numbers 0, 1, 2 and computes the confusion matrix to get an idea of the performance of the classifier. However, this mapping was entirely your choice, and were you to select a different mapping, the numbers in the confusion matrix would change:

# compute confusion matrix if 
# Iris-setosa = 0
# Iris-versicolor = 1
# Iris-virginica = 2
 
# Convert target variables 'species' from strings into integers.
from sklearn.preprocessing import LabelEncoder
labelEncoder = LabelEncoder()
labelEncoder.fit(df_iris_target_test)
df_iris_target_test = labelEncoder.transform(df_iris_target_test)
 
from sklearn.metrics import confusion_matrix
cm_kmeans = confusion_matrix(df_iris_target_test, kmeans_predictions)

The confusion matrix indicates that the model is classifying most observations correctly, except for five instances of the third class Iris-virginica that have been classed as the second class, Iris-versicolor.

Training a Scikit-Learn Model on a Dedicated Training Instance

In this section, you will use the AWS SageMaker SDK for Python from your notebook instance to create a training job that will create a dedicated training instance and train a Scikit-learn k-means classifier on the instance. The trained model will be stored in an Amazon S3 bucket, after which you will use another function provided by the Amazon SageMaker Python SDK to deploy the model to a dedicated prediction instance and create a prediction endpoint. Finally, you will use the AWS SageMaker SDK for Python to interact with this prediction endpoint from your Jupyter Notebook file to make predictions on data the model has not seen (the test set).

A dedicated training instance is an EC2 instance that is used to train a model from data in an Amazon S3 bucket and save the trained model to another Amazon S3 bucket. However, unlike a notebook instance, a dedicated training instance does not have a Jupyter Notebook server and is terminated automatically once the training process is complete. Dedicated training instances are created from Docker images. For built-in machine learning algorithms, Amazon SageMaker packages the algorithms in Docker images that include the necessary software to train a machine learning model. The location of input and output buckets, as well as any hyperparameters for model training, are usually specified as command-line arguments when the Docker container is instantiated from the image.

In addition to Docker images for built-in algorithms, Amazon SageMaker also provides Docker images that include standard machine learning libraries like Scikit-learn, Google TensorFlow, and Apache MXNet. Regardless of whether the Docker image contains an implementation of a built-in algorithm, or a general-purpose machine-learning library, the Docker images themselves are stored in Docker registries within each AWS region. You will need to provide the path to the Docker image when creating the training job. You can find the paths to the Docker registries for Scikit-learn Docker images in each supported region at https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-docker-containers-frameworks.html.

The dedicated EC2 training instances, once created, will need to assume an IAM role that will allow code running in the training instances access to other resources in your AWS account such as Amazon S3 buckets that contain the training data. Since the IAM role that is used to create your notebook instance has the correct permissions to access the relevant AWS resources, the most common approach is to use the same IAM role for the EC2 training instances.

The Docker image that allows you to create a Scikit-learn training instance does not come pre-packaged with code that builds any particular type of classifier. You can think of a Scikit-learn training instance as a barebones virtual machine that comes pre-packaged with Python, Scikit-learn, and a number of popular machine learning Python frameworks. To build a model from training data on a Scikit-learn training instance, you will need to write your model-building code in a Python script file and execute the file on the training instance. The model training script will have code that can access training data from an Amazon S3 bucket, train a model using Scikit-learn classes, and store the trained model in an Amazon S3 bucket.

To get started, launch the kmeans-iris-flowers notebook instance and use the Upload button to upload the sklearn-kmeans-training-script.py file provided with the resources that accompany this chapter (Figure 16.14).

Screenshot of uploading a file to a notebook instance.

FIGURE 16.14 Uploading a file to a notebook instance

The Python code within this file is presented here:

import argparse
import pandas as pd
import os
 
from sklearn.cluster import KMeans
from sklearn.externals import joblib
from sklearn.preprocessing import LabelEncoder
 
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
 
    # Read any hyperparameters
    parser.add_argument('--n_clusters', type=int, default=3)
 
    # Sagemaker specific arguments, use environment values for defaults.
    parser.add_argument('--model_dir', type=str, 
        default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, 
        default=os.environ.get('SM_CHANNEL_TRAIN'))
 
    args = parser.parse_args()
 
    # Read input file iris_train.csv .
    input_file = os.path.join(args.train, 'iris_train.csv')
    df_iris_train = pd.read_csv(input_file, header=0, engine="python")
 
    # Convert target variables 'species' from strings into integers.
    labelEncoder = LabelEncoder()
    labelEncoder.fit(df_iris_train['species'])
    df_iris_train['species'] = labelEncoder.transform(df_iris_train['species'])
 
    # Separate training and validation dataset into 
    # separate features and target variables. 
    #
    # Assume that the first column in each dataset is the target variable.
    #
    # k-means does not require labelled training data, therefore 
    # df_iris_target_train will not be used.
    df_iris_features_train = df_iris_train.iloc[:,1:]
    df_iris_target_train = df_iris_train.iloc[:,0]
 
    # Create a K-Means multi-class classifier.
    kmeans_model = KMeans(n_clusters=args.n_clusters)
    kmeans_model.fit(df_iris_features_train)
 
    # Save the model.
    joblib.dump(kmeans_model, os.path.join(args.model_dir, 
        "sklearn-kmeans-model.joblib"))
 
 
# deserializer.
def model_fn(model_dir):
    model = joblib.load(os.path.join(model_dir, "sklearn-kmeans-model.joblib"))
    return model

Let's briefly look at the contents of this script. When you create a training job using the SageMaker Python SDK, you provide the name of the file that contains the training script, as well as any hyperparameters required by the script. The k-means model we will build will expect a single hyperparameter called n_clusters, which will be injected into the constructor of the Scikit-learn KMeans class.

When the EC2 instance starts up, your script file will be installed as a Python module with the same name as the name of the script file and executed using a command-line statement similar to:

python -m <your_script> --<your_hyperparameters>

For example, if the training job were created using the script file listed earlier, and the value of the n_clusters hyperparameter is specified as 3, then the command-line statement that will be executed on the training instance to kick off the training process will be:

python -m sklearn-kmeans-training-script --num_clusters=3

Within your script, you place your model training code inside an if __name__ == '__main__': condition block. Placing the training code inside this if condition block ensures that the code is only executed when the module is executed as the main module from the command line, and not when the module is executed by importing it into another Python file with a standard Python import statement.

Within the model-training code, you can use an ArgumentParser object to read any command-line arguments that were passed to the module by Amazon SageMaker. The training job that you will create later in this section will pass the number of k-means clusters as a hyperparameter. The following snippet demonstrates how you can read command-line arguments in your script file:

parser = argparse.ArgumentParser()
 
# Expect an argument called n_clusters, apply a default value if 
# argument is missing.
parser.add_argument('--n_clusters', type=int, default=3)
 
# Parse the arguments
args = parser.parse_args()
 
# You can now access the n_clusters command-line argument
# using args.n_clusters

In addition to hyperparameters that can be used to instantiate Scikit-learn classes, your script will also need some additional Amazon SageMaker–specific information such as the location of the training dataset, and the location where the trained model should be saved. You could pass this information to the script as additional command-line arguments, or you could read some of this information from environment variables within your script.

When a training instance is created, Amazon SageMaker automatically sets up several environment variables, and these environments can also be accessed within the script. If your script needs an Amazon SageMaker–specific runtime value that has not been provided a command-line argument, a common practice is to fall back to using the environment variables as the defaults. The full list of environment variables that are available to scripts running in training instances is available at https://github.com/aws/sagemaker-containers.

The most common environment variables you are likely to read are:

  • SM_OUTPUT_DATADIR: Contains a filesystem path within the training instance where your script can store temporary artifacts that are needed to create the machine learning model. This local filesystem path is behind the scenes, mapped to an Amazon S3 bucket, and therefore the files will be written to an Amazon S3 bucket.
  • SM_MODEL_DIR: Contains a filesystem path within the training instance where your script will store the trained model. This local filesystem path is behind the scenes, mapped to an Amazon S3 bucket, and therefore the trained model will be written to an Amazon S3 bucket.
  • SM_CHANNEL_TRAINING: Contains a filesystem path within the training instance where your script can read the training data. This local filesystem path is behind the scenes, mapped to an Amazon S3 bucket, and therefore the training data will be read from an Amazon S3 bucket.

The following snippet demonstrates how you can use the ArgumentParser object to read Amazon SageMaker–specific runtime information from the command-line arguments, and fall back to using environment variables if the values have not been specified on the command line:

parser = argparse.ArgumentParser()
 
# SageMaker specific arguments. 
# Defaults are set in the environment variables.
parser.add_argument('--model-dir', type=str,   
    default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--train', type=str, 
    default=os.environ['SM_CHANNEL_TRAIN'])
 
# Parse the arguments
args = parser.parse_args()
 
# You can now access the S3 bucket where the training files are stored
# via args.train
 

The actual code to create a Scikit-learn k-means classifier is similar to the code that was presented in the previous section when you trained the classifier on the local notebook instance. The model-building code in the script file starts by loading the training data present in a CSV file called iris_train.csv in the Amazon S3 bucket referenced by args.train and loads the contents of this file into a Pandas dataframe:

    # Read input file iris_train.csv .
    input_file = os.path.join(args.train, 'iris_train.csv')
    df_iris_train = pd.read_csv(input_file, header=0, engine="python")
 

Next, the model-building code uses Scikit-learn's LabelEncoder class to convert the categorical attribute species from discrete strings to integers. Before conversion, the target attribute (species) can have one of the string values Iris-setosa, Iris-versicolor, or Iris-virginica. After conversion, the target attribute will have one of the integer values 0, 1, 2, respectively:

 
    # Convert target variables 'species' from strings into integers.
    labelEncoder = LabelEncoder()
    labelEncoder.fit(df_iris_train['species'])
    df_iris_train['species'] = labelEncoder.transform(df_iris_train['species'])
 

The features and target variables are then extracted from the df_iris_train dataframe into separate dataframe objects and a KMeans classifier is trained on the features. The n_clusters argument of the KMeans constructor is assigned a value of args.n_clusters, which is a hyperparameter read from the command line:

 
    # Separate training and validation dataset into 
    # separate features and target variables. 
    #
    # Assume that the first column in each dataset is the target variable.
    #
    # k-means does not require labelled training data, therefore 
    # df_iris_target_train will not be used.
 
    df_iris_features_train = df_iris_train.iloc[:,1:]
    df_iris_target_train = df_iris_train.iloc[:,0]
 
    # Create a K-Means multi-class classifier.
    kmeans_model = KMeans(n_clusters=args.n_clusters)
    kmeans_model.fit(df_iris_features_train)

After the model is trained, it is saved to the Amazon S3 bucket specified in args.model_dir using the following statement:

    # Save the model.
    joblib.dump(kmeans_model, os.path.join(args.model_dir, 
        "sklearn-kmeans-model.joblib"))

The model will be saved to a file called sklearn-kmeans-model.joblib. It is worth noting that the training instance does not save the model artifacts automatically. If you do not explicitly save the model, the training instance will terminate after model training is complete and you will not have access to the model, and therefore will not be able to create a prediction instance.

The script file must also contain a function called model_fn(), which by convention is a function that will be used to read the model artifacts to create a prediction instance:

# deserializer.
def model_fn(model_dir):
    model = joblib.load(os.path.join(model_dir, "sklearn-kmeans-model.joblib"))
    return model

Now that you have prepared a model-training script file, it is time to use a Jupyter Notebook file to create a training job. Create a new Jupyter Notebook file on your notebook instance using the conda_python3 kernel. Change the title of the notebook to sklearn-kmeans-iris-flowers and type the following code in an empty notebook cell:

import sagemaker
 
# Get a SageMaker-compatible role used by this Notebook Instance.
role = sagemaker.get_execution_role()
 
# get a SageMaker session object, that can be
# used to manage the interaction with the SageMaker API.
sagemaker_session = sagemaker.Session()
 
# train a Scikit-learn KMeans classifier on a dedicated instance
# send hyperparameter n_clusters = 3.
from sagemaker.sklearn.estimator import SKLearn
 
sklearn = SKLearn(entry_point='sklearn-kmeans-training-script.py', 
                  train_instance:type='ml.m4.xlarge', 
                  role=role, 
                  sagemaker_session=sagemaker_session, 
                  hyperparameters={'n_clusters': 3},
                  output_path='s3://awsml-sagemaker-results/')
 
sklearn.fit({'train': 's3://awsml-sagemaker-source/'})

Execute the code in the notebook cell to launch a training job that will result in SageMaker creating a dedicated ml.m4.xlarge EC2 instance from the default Docker image for Scikit-learn model training. You can specify a different instance type, but keep in mind that more powerful instance types have higher running costs associated with them. The SKLearn class is part of the AWS SageMaker Python SDK and provides a convenient mechanism to handle end-to-end training and deployment of Scikit-learn models. The constructor for the class accepts several arguments, including the path to the Python script file that contains your model-building code, the type of training instance that you wish to create, the IAM role that should be assigned to the new instance, and model-building hyperparameters that will be passed as command-line arguments to your script running on the instance. You can learn more about the parameters of the SKLearn class at https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html.

When you execute the code in a notebook cell, Amazon SageMaker will create a new dedicated training instance and execute your script file within the instance once the instance is ready. This can be a time-consuming process and may take several minutes. While the training process is underway, you will see various status messages printed below the notebook cell (Figure 16.15).

Screenshot of the training process where various status messages are printed below the notebook cell.

FIGURE 16.15 Using a notebook instance to create a training job

The training process is complete when you see lines similar to the following in the output:

2019-04-28 19:31:46,805 sagemaker-containers INFO     Reporting training SUCCESS
 
2019-04-28 19:31:54 Completed - Training job completed
Billable seconds: 24

After the training process is complete, Amazon SageMaker will automatically terminate the instance that was created to support the training. The model that was created will be saved to the awsml-sagemaker-results bucket in a folder that has the structure <job name>/output/. You can find the value of the job name by inspecting the job_name attribute of the SM_TRAINING_ENV variable in the log messages. You can also access the full path to the model artifact by examining the value of the module_dir attribute:

In the log messages shown here, the file referenced in the module_dir attribute is s3://awsml-sagemaker-results/sagemaker-scikit-learn-2019-04-28-19-35-16-021/source/sourcedir.tar.gz, and not sklearn-kmeans-model.joblib as you had specified in your Python training script. This is because Amazon SageMaker compresses the model artifacts into a tar.gz file before uploading to Amazon S3. If you were to download and extract the tar.gz file, you would find the sklearn-kmeans-model.joblib file inside it.

You do not necessarily have to make a note of the location of the model or the job name listed in the messages below the notebook cell. You can also view a list of trained models from the Models menu item of the AWS SageMaker management console (Figure 16.16) and access the model artifact file from there.

Screenshot of a list of trained models from the models menu item of the AWS SageMaker management console.

FIGURE 16.16 List of trained models

Now that you have a trained model, you can deploy the model to one or more dedicated prediction instances and then either create an HTTPS API endpoint for creating predictions on one item at a time, or create a batch transform job to obtain predictions on entire datasets. Using batch transforms is not covered in this chapter; however, you can find more information at https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-batch-transform.html.

You can deploy models into prediction instances using several different methods, including using the AWS SageMaker Python SDK and the AWS management console. Regardless of the manner in which you choose to deploy the model, the process of deployment involves first creating an endpoint configuration object, and then using the endpoint configuration object to create the prediction instances and the HTTPS endpoint that can be used to get inferences from the deployed model. The endpoint configuration contains information on the location of the model file, the type and number of compute instances that the model will be deployed to, and the auto-scaling policy to be used to scale up the number of instances as needed.

If you use the high-level Amazon SageMaker Python SDK to deploy your model, the SDK provides a convenience function that takes care of creating the endpoint configuration, creation of compute resources, deployment of the model, and the creation of the HTTPS endpoint. If you use the lower-level AWS boto3 Python SDK you will need to perform the individual steps in sequence yourself. Keep in mind that prediction instances incur additional costs and are not automatically terminated after you have made predictions.

Execute the following code in an empty cell of the sklearn-kmeans-iris-flowers notebook to deploy the trained Scikit-learn model to a single ML compute instance and create an HTTPS endpoint:

# create a prediction instance
predictor = sklearn.deploy(initial_instance:count=1, instance:type="ml.m4.xlarge")

This code assumes that the sklearn object has been created in the previous notebook cell where you trained the model. This is important because the deploy() function does not have an argument that lets you specify the path to the model file, but instead uses whatever model file is referenced within the sklearn object.

When you execute this code, Amazon SageMaker will take several minutes to create an endpoint configuration, create the compute instances, deploy your model onto the instances, and create an HTTP endpoint. You can view the status of the process in log messages listed below the notebook cell:

Once the prediction endpoint is created, the deploy() function will return an object of class SKLearnPredictor, which provides a convenience function called predict() that can be used to make predictions from the HTTPS endpoint. The prediction endpoint is Internet-facing and is secured using AWS Signature V4 authentication. You can access this endpoint in a variety of ways, including the AWS CLI, tools such as Postman, and the language-specific AWS SDKs. If you use the AWS CLI, or one of the AWS SDKs, authentication will be achieved behind the scenes for you. You can learn more about AWS Signature V4 at https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html.

If you want to expose the prediction endpoint as a service to your consumers, you can create an endpoint on an Amazon API Gateway instance, and set up the API Gateway to execute an AWS Lambda function when it receives a request. You can then use one of the language-specific SDKs in the AWS Lambda function to interact with the prediction endpoint. The advantage of this approach is that your clients do not need to worry about AWS Signature V4 authentication, and an API Gateway can provide several features that are critical to managing the business-related aspects of a commercial API service, such as credential management, credential rotation, support for OIDC, API versioning, traffic management and so on.

If you are using the AWS SageMaker Python SDK from a notebook instance, you can use the predict() function of the SKLearnPredictor instance to make single predictions. Behind the scenes, the predict() function will use a temporary set of credentials associated with the IAM role assumed by the notebook instance to authenticate with the prediction endpoint. Execute the following code in an empty notebook cell to make predictions on the test dataset stored in the file iris_test.csv using the predict() function:

# load iris_test.csv from Amazon S3 and split the features 
# and target variables into separate dataframes.
import boto3
import sagemaker
import io
 
import pandas as pd
import numpy as np
 
# load training and validation dataset from Amazon S3
s3_client = boto3.client('s3')
s3_bucket_name='awsml-sagemaker-source'
 
response = s3_client.get_object(Bucket='awsml-sagemaker-source', Key='iris_test.csv')
response_body = response["Body"].read()
df_iris_test = pd.read_csv(io.BytesIO(response_body), header=0, index_col=False, delimiter=",", low_memory=False)
 
# Convert target variables 'species' from strings into integers.
from sklearn.preprocessing import LabelEncoder
labelEncoder = LabelEncoder()
labelEncoder.fit(df_iris_test['species'])
df_iris_test['species'] = labelEncoder.transform(df_iris_test['species'])
 
# separate validation dataset into separate features and target variables
# assume that the first column in each dataset is the target variable.
df_iris_features_test= df_iris_test.iloc[:,1:]
df_iris_target_test = df_iris_test.iloc[:,0]
 
 
# use the prediction instance to create predictions.
predictions = predictor.predict(df_iris_features_test.values)

You can use the Python print() function to examine the predictions:

To terminate the prediction instances and the associated HTTPS endpoint, you can, once again, use the high-level interface provided by the AWS SageMaker Python SDK. Execute the following code in a free notebook cell to terminate the prediction endpoint and deactivate the prediction endpoint:

You can also use the AWS SageMaker management console to view the list of active prediction endpoints, deactivate a prediction endpoint and terminate the associated prediction instances, and create new prediction endpoints from endpoint configurations.

Training a Model Using a Built-in Algorithm on a Dedicated Training Instance

In the previous sections of this chapter you created Scikit-learn-based k-means classifiers and trained them on the notebook instance as well as on a dedicated training instance. In this section you will use your notebook instance to train a k-means classifier on the Iris flowers dataset using Amazon SageMaker's built-in implementation of the k-means algorithm. Amazon SageMaker provides cloud-optimized implementations of several popular machine learning algorithms, and k-means is one of them. The steps involved in training a model with a built-in algorithm are similar to, if not somewhat easier than, training the model on a dedicated Scikit-learn instance. Amazon SageMaker provides Docker images for each built-in algorithm. These algorithm-specific Docker images are very similar to the generic Scikit-learn Docker image, except that you do not need to deploy any model building onto them. The code is already present in the image and preconfigured to accept a standard set of inputs, such as hyperparameters and data location from command-line arguments.

The general format of the path to an algorithm-specific Docker image is as follows:

<ecr_path>/<algorithm>:<tag>

The ecr_path portion refers to the path to the Amazon ECR Docker registry for the region in which you want to create the training instances, the algorithm portion is an identifier that identifies a specific algorithm, and the tag portion identifies a version of the Docker image. A full list of algorithm-specific Docker images is available at https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-Docker-registry-paths.html.

Using the information on this page at the time this chapter was written, the path to the latest version of the Docker image that contains the built-in k-means classifier, in the eu-west-1 region, is 438346466558.dkr.ecr.eu-west-1.amazonaws.com/kmeans:latest. It is worth noting that not all algorithms may be available in each AWS region.

If you are using the high-level interface exposed by the AWS SageMaker for Python SDK from your notebook instance, you do not need to explicitly specify the path to the Docker image. If, however, you are using some of the lower-level classes, or using the AWS boto3 SDK, you will need to provide the path to the image. This chapter does not cover using the lower-level Estimator class or the low-level boto3 SDK; if you would like more information on training a model based on a built-in algorithm with the low-level SDK, visit https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model.html.

The rest of this section will look at using the high-level interface exposed by the AWS SageMaker for Python SDK to create a training job using a built-in algorithm from a notebook instance. To get started, launch the kmeans-iris-flowers notebook instance and create a new Jupyter Notebook file on your notebook instance using the conda_python3 kernel. Change the title of the notebook to kmeans-iris-flowers and type the following code in an empty notebook cell:

import sagemaker
import boto3
import io
import pandas as pd
import numpy as np
 
# Get a SageMaker-compatible role used by this Notebook Instance.
role = sagemaker.get_execution_role()
 
# get a SageMaker session object, that can be
# used to manage the interaction with the SageMaker API.
sagemaker_session = sagemaker.Session()
 
# create a training job to train a KMeans model using
# Amazon SageMaker's own implementation of the k-means algorithm
#
# set hyperparameter k = 3
from sagemaker import KMeans
 
input_location = 's3://awsml-sagemaker-source/iris-train.csv'
output_location = 's3://awsml-sagemaker-results'
 
kmeans_estimator = KMeans(role=role,
                train_instance:count=1,
                train_instance:type='ml.m4.xlarge',
                output_path=output_location,
                k=3)
 
 
# load training and validation dataset from Amazon S3
s3_client = boto3.client('s3')
s3_bucket_name='awsml-sagemaker-source'
 
response = s3_client.get_object(Bucket='awsml-sagemaker-source', Key='iris_train.csv')
response_body = response["Body"].read()
df_iris_train = pd.read_csv(io.BytesIO(response_body), header=0, delimiter=",", low_memory=False)
 
response = s3_client.get_object(Bucket='awsml-sagemaker-source', Key='iris_test.csv')
response_body = response["Body"].read()
df_iris_test = pd.read_csv(io.BytesIO(response_body), header=0, index_col=False, delimiter=",", low_memory=False)
 
# Convert target variables 'species' from strings into integers.
from sklearn.preprocessing import LabelEncoder
labelEncoder = LabelEncoder()
labelEncoder.fit(df_iris_train['species'])
labelEncoder.fit(df_iris_test['species'])
df_iris_train['species'] = labelEncoder.transform(df_iris_train['species'])
df_iris_test['species'] = labelEncoder.transform(df_iris_test['species'])
 
# separate training and validation dataset into separate features and target datasets
# assuming that the first column of the iris_train.csv and iris_test.csv files
# contains the target attribute.
#
# since training a k-means classifier does not require labelled training data,
# you will not make use of df_iris_target_train
 
df_iris_features_train= df_iris_train.iloc[:,1:]
df_iris_target_train = df_iris_train.iloc[:,0]
 
df_iris_features_test= df_iris_test.iloc[:,1:]
df_iris_target_test = df_iris_test.iloc[:,0]
 
 
# create a training job.
train_data = df_iris_features_train.values.astype('float32')
record_set = kmeans_estimator.record_set(train_data)
kmeans_estimator.fit(record_set)

Execute the code in the notebook cell to launch a training job that will result in SageMaker creating a dedicated ml.m4.xlarge EC2 instance from the default Docker image that contains the code for the k-means algorithm and kick off model training. Model training will take several minutes, during which time you will see log messages appear beneath the notebook cell (Figure 16.17). Once the model is trained, Amazon SageMaker will save it to the awsml-sagemaker-results Amazon S3 bucket, and terminate the training instances.

Screenshot of training a model based on a built-in algorithm using an AWS SageMaker notebook instance.

FIGURE 16.17 Training a model based on a built-in algorithm using an AWS SageMaker notebook instance

Let's briefly examine some of the key aspects of this code snippet. You start by accessing the IAM role associated with the notebook instance using the statements:

role = sagemaker.get_execution_role()

You then instantiate a KMeans class with the IAM role, the type of training instance you want, the number of instances, the location of the output data, and the number of clusters:

kmeans_estimator = KMeans(role=role,
                train_instance:count=1,
                train_instance:type='ml.m4.xlarge',
                output_path=output_location,
                k=3)

The KMeans class is part of the AWS SageMaker SDK for Python and provides a high-level interface to create a training job with the built-in k-means algorithm. You can learn more about the KMeans class at https://sagemaker.readthedocs.io/en/stable/kmeans.html.

You then proceed to load the training and test data from the iris_train.csv and iris_test.csv files in the awsml-sagemaker-source Amazon S3 bucket into Pandas dataframes and convert the categorical target attribute species into a number:

# load training and validation dataset from Amazon S3
s3_client = boto3.client('s3')
s3_bucket_name='awsml-sagemaker-source'
 
response = s3_client.get_object(Bucket='awsml-sagemaker-source', Key='iris_train.csv')
response_body = response["Body"].read()
df_iris_train = pd.read_csv(io.BytesIO(response_body), header=0, delimiter=",", low_memory=False)
 
response = s3_client.get_object(Bucket='awsml-sagemaker-source', Key='iris_test.csv')
response_body = response["Body"].read()
df_iris_test = pd.read_csv(io.BytesIO(response_body), header=0, index_col=False, delimiter=",", low_memory=False)
 
# Convert target variables 'species' from strings into integers.
from sklearn.preprocessing import LabelEncoder
labelEncoder = LabelEncoder()
labelEncoder.fit(df_iris_train['species'])
labelEncoder.fit(df_iris_test['species'])
df_iris_train['species'] = labelEncoder.transform(df_iris_train['species'])
df_iris_test['species'] = labelEncoder.transform(df_iris_test['species'])

Creating the training job is achieved by calling the fit() method on the KMeans instance. However, Amazon SageMaker's implementation of k-means prefers the training data to be specified in the protobuf recordIO format. You can learn more about this format at https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html.

The following lines of code split the training dataset into a dataframe that contains the features and one that contains the target labels. The feature dataframe is converted to a NumPy array and then to a protobuf recordIO buffer. The recordIO buffer is provided as input to the fit() method to kick off the model training job:

df_iris_features_train= df_iris_train.iloc[:,1:]
df_iris_target_train = df_iris_train.iloc[:,0]
 
df_iris_features_test= df_iris_test.iloc[:,1:]
df_iris_target_test = df_iris_test.iloc[:,0]
 
# create a training job.
train_data = df_iris_features_train.values.astype('float32')
record_set = kmeans_estimator.record_set(train_data)
kmeans_estimator.fit(record_set)

The KMeans class is known as an estimator class and inherits from a base class called EstimatorBase. The AWS SageMaker SDK for Python contains subclasses of EstimatorBase corresponding to the different built-in algorithms supported by AWS SageMaker. There is also a subclass called Estimator that can be used to train a model using any of the built-in models, and provides lower-level controls such as the ability to choose a specific Docker image. You can learn more about the Estimator class at https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.

When the model has finished training, execute the following code in an empty notebook cell to deploy the model to a prediction instance and create an HTTPS endpoint that can be used to make predictions:

# deploy the model to a prediction instance
# and create a prediction endpoint.
predictor = kmeans_estimator.deploy(initial_instance:count=1, instance:type="ml.m4.xlarge")

Deploying the model may take several minutes. Once the model is deployed, you can use it to make predictions on the Iris flowers test set:

test_data = df_iris_features_test.values.astype('float32')
 
predictions = predictor.predict(test_data)
print (predictions)

The prediction for each row in the test data is returned as a JSON object that contains information on the cluster label, and the distance of the row from the centroid of the cluster:

label {
  key: "closest_cluster"
  value {
    float32_tensor {
      values: 0.0
    }
  }
}
label {
  key: "distance:to_cluster"
  value {
    float32_tensor {
      values: 0.33853915333747864
    }
  }
}

To terminate the prediction instance and associated HTTP endpoint, execute the following code in a notebook cell:

# terminate the prediction instance and associated
# HTTPS endpoint.
kmeans_estimator.delete_endpoint()
 

Summary

  • Amazon SageMaker is a fully managed web service that provides the ability to explore data, engineer features, and train machine learning models on AWS cloud infrastructure using Python code.
  • Amazon SageMaker provides implementations of a number of cloud-optimized versions of machine learning algorithms such as XGBoost, factorization machines, and PCA.
  • Amazon SageMaker also provides the ability to create your own algorithms based on popular frameworks such as Scikit-learn, Google TensorFlow, and Apache MXNet.
  • The Amazon SageMaker SDK for Python provides a convenient high-level, object-oriented interface for Python developers.
  • Amazon SageMaker notebook instances are EC2 instances in your account that come preconfigured with a Jupyter Notebook server, and a number of common Python libraries.
  • Training jobs create dedicated compute instances in the AWS cloud that contain model-building code, load your training data from Amazon S3, execute the model-building code on your training data, and save the trained model to Amazon S3.
  • When the training job is complete, the compute instances that were provisioned to support the training process will automatically be terminated.
  • Deploying a model into production involves creating one or more compute instances, deploying your model onto these instances, and providing an API that can be used to make predictions using the deployed model.
  • Prediction instances are not automatically terminated.
  • An endpoint configuration ties together information on the location of a trained machine learning model, type of compute instances, and an auto-scaling policy.
  • A prediction endpoint is an HTTPS REST API endpoint that can be used to get single predictions from a deployed model. The HTTP endpoint is secured using AWS Signature V4 authentication.
  • If you need to make predictions on an entire dataset, you can create use Amazon SageMaker Batch Transform to create a batch prediction job from a trained model.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.12.202