To support a large number of fast-moving machine learning (ML) initiatives, many organizations often decide to build enterprise ML platforms capable of supporting the full ML life cycle, as well as a wide range of usage patterns, which also needs to be automated and scalable. As a practitioner, I have often been asked to provide architecture guidance on how to build enterprise ML platforms. In this chapter, we will discuss the core requirements for enterprise ML platform design and implementation. We will cover topics such as workflow automation, infrastructure scalability, and system monitoring. You will learn about architecture patterns for building technology solutions that automate the end-to-end ML workflow and deployment at scale. We will also dive deep into other core enterprise ML architecture components such as model training, model hosting, the feature store, and the model registry at enterprise scale.
Specifically, we will cover the following topics:
Governance and security is another important topic for enterprise ML, which we will cover in greater detail in Chapter 11, ML Governance, Bias, Explainability, and Privacy. To get started, let's discuss the key requirements for an enterprise ML platform.
We will continue to use the AWS environment for the hands-on portion of this chapter. All the source code mentioned in this chapter can be found at https://github.com/PacktPublishing/The-Machine-Learning-Solutions-Architect-Handbook/tree/main/Chapter09.
To deliver the business values for ML at scale, organizations need to be able to experiment quickly with different scientific approaches, ML technologies, and datasets at scale. Once the ML models have been trained and validated, they need to be deployed to production with minimal friction. While there are similarities between a traditional enterprise software system and an ML platform, such as scalability and security, an enterprise ML platform poses many unique challenges, such as integrating with the data platform and high-performance computing infrastructure for large-scale model training. Now, let's talk about some specific enterprise ML platform requirements:
With that, we have talked about the key requirements of an enterprise ML platform. Next, let's discuss how AWS ML and DevOps services, such as SageMaker, CodePipeline, and Step Functions, can be used to build an enterprise-grade ML platform.
Building an enterprise ML platform on AWS starts with creating different environments to enable different data science and operations functions. The following diagram shows the core environments that normally make up an enterprise ML platform. From an isolation perspective, in the context of the AWS cloud, each environment in the following diagram is a separate AWS account:
As we discussed in Chapter 8, Building a Data Science Environment Using AWS ML Services, data scientists use the data science environment for experimentation, model building, and tuning. Once these experiments are completed, the data scientists commit their work to the proper code and data repositories. The next step is to train and tune the ML models in a controlled and automated environment using the algorithms, data, and training scripts that were created by the data scientists. This controlled and automated model training process will help ensure consistency, reproducibility, and traceability for model building at scale. The following are the core functionalities and technology options provided by the training, hosting, and shared services environments:
In addition to the core ML environments, there are other dependent environments, such as security, governance, monitoring, and logging, that are required in the enterprise ML platform:
With that, you have had an overview of the core building blocks of an enterprise ML platform. Next, let's dive deep into several core areas. Note that there are different patterns and services we can follow to build an ML platform on AWS. In this chapter, we will cover one of the enterprise patterns.
Within an enterprise, a model training environment is a controlled environment with well-defined processes and policies on how it is used and who can use them. Normally, it should be an automated environment that's managed by an MLOps team, though it can be self-service enabled for direct usage by data scientists.
Automated model training and tuning are the core capabilities of the model training environment. To support a broad range of use cases, a model training environment needs to support different ML and deep learning frameworks, training patterns (such as single-node and distributed training), and hardware (different CPUs and GPUs).
The model training environment manages the life cycle of the model training process. This can include authentication and authorization, infrastructure provisioning, data movement, data preprocessing, ML library deployment, training loop management and monitoring, model persistence and registry, training job management, and lineage tracking. From a security perspective, the training environment needs to provide security capabilities for different isolation requirements, such as network isolation, job isolation, and artifacts isolation. To assist with operational support, a model training environment also needs to be able to training status logging, metrics reporting, and training job monitoring and alerting.
Next, let's learn how the SageMaker training service can be used in a controlled model training environment in an enterprise setting.
The SageMaker training service provides built-in modeling training capabilities for a range of ML/DL libraries. In addition, you can bring your own Docker containers for customized model training needs. The following are a subset of supported options for the SageMaker Python SDK:
from sagemaker.tensorflow import TensorFlow
tf_estimator = TensorFlow(
entry_point="<Training script name>",
role= "<AWS IAM role>",
instance_count=<Number of instances),
instance_type="<Instance type>",
framework_version="<TensorFlow version>",
py_version="<Python version>",)
tf_estimator.fit("<Training data location>")
from sagemaker.pytorch import PyTorch
pytorch_estimator = PyTorch(
entry_point="<Training script name>",
role= "<AWS IAM role>",
instance_count=<Number of instances),
instance_type="<Instance type>",
framework_version="<PyTorch version>",
py_version="<Python version>",)
pytorch_estimator.fit("<Training data location>")
from sagemaker.xgboost.estimator import XGBoost
xgb_estimator = XGBoost(
entry_point=" <Training script name>",
hyperparameters=<List of hyperparameters>,
role=<AWS IAM role>,
instance_count=<Number of instances>,
instance_type="<Instance type>",
framework_version="<Xgboost version>")
xgb_estimator.fit("<train data location>")
from sagemaker.sklearn.estimator import SKLearn
sklearn_estimator = SKLearn(
entry_point=" <Training script name>",
hyperparameters=<List of hyperparameters>,
role=<AWS IAM role>,
instance_count=<Number of instances>,
instance_type="<Instance type>",
framework_version="<sklearn version>")
Sklearn_estimator.fit("<training data>")
from sagemaker.estimator import Estimator
custom_estimator = Estimator (
Custom_training_img,
role=<AWS IAM role>,
instance_count=<Number of instances>,
instance_type="<Instance type>")
custom_estimator.fit("<training data location>")
In addition to using the SageMaker Python SDK to kick off training, you can also use the boto3 library and SageMaker CLI commands to start training jobs.
The SageMaker training service is exposed through a set of APIs and can be automated by integrating with external applications or workflow tools, such as Airflow and AWS Step Functions. For example, it can be one of the steps in an Airflow-based pipeline for an end-to-end ML workflow. Some workflow tools, such as Airflow and AWS Step Functions, also provide SageMaker-specific connectors to interact with the SageMaker training service more seamlessly. The SageMaker training service also provides Kubernetes operators, so it can be integrated and automated as part of the Kubernetes application flow. The following sample code shows how to kick off a training job using the low-level API via the AWS boto3 SDK:
import boto3
client = boto3.client('sagemaker')
response = client.create_training_job(
TrainingJobName='<job name>',
HyperParameters={<list of parameters and value>},
AlgorithmSpecification={...},
RoleArn='<AWS IAM Role>',
InputDataConfig=[...],
OutputDataConfig={...},
ResourceConfig={...},
...
}
Regarding using Airflow as the workflow tool, the following sample shows how to use the Airflow SageMaker operator as part of the workflow definition. Here, train_config contains training configuration details, such as the training estimator, training instance type and number, and training data location:
import airflow
from airflow import DAG
from airflow.contrib.operators.sagemaker_training_operator import SageMakerTrainingOperator
default_args = {
'owner': 'myflow',
'start_date': '2021-01-01'
}
dag = DAG('tensorflow_training', default_args=default_args,
schedule_interval='@once')
train_op = SageMakerTrainingOperator(
task_id='tf_training',
config=train_config,
wait_for_completion=True,
dag=dag)
SageMaker also has a built-in workflow automation tool called SageMaker Pipelines. A training step can be created using the SageMaker TrainingStep API and become part of the larger SageMaker Pipelines workflow.
SageMaker training manages the life cycle of the model training process. It uses AWS IAM as the mechanism to authenticate and authorize access to its functions. Once authorized, it provides the desired infrastructure, deploys the software stacks for the different model training requirements, moves the data from sources to training nodes, and kicks off the training job. Once the training job has been completed, the model artifacts are saved into an S3 output bucket and the infrastructure is torn down. For lineage tracing, model training metadata such as source datasets, model training containers, hyperparameters, and model output locations are captured. Any logging from the training job runs is saved in CloudWatch Logs, and system metrics such as CPU and GPU utilization are captured in the CloudWatch metrics.
Depending on the overall end-to-end ML platform architecture, a model training environment can also host services for data preprocessing, model validation, and model training postprocessing, as those are important steps in an end-to-end ML flow. There are multiple technology options available for this, such as the SageMaker Processing service and AWS Lambda.
An enterprise-grade model hosting environment needs to support a broad range of ML frameworks in a secure, performant, and scalable way. It should come with a list of pre-built inference engines that can serve common models out of the box behind a RESTful API or via the gRPC protocol. It also needs to provide flexibility to host custom-built inference engines for unique requirements. Users should also have access to different hardware devices, such as CPU, GPU, and purpose-built chips, for the different inference needs.
Some model inference patterns demand more complex inference graphs, such as traffic split, request transformations, or model ensemble support. A model hosting environment can provide this capability as an out-of-the-box feature or provide technology options for building custom inference graphs. Other common model hosting capabilities include concept drift detection and model performance drift detection. Concept drift occurs when the statistical characteristics of the production data deviate from the data that's used for model training. An example of concept drift is the mean and standard deviation of a feature changing significantly in production from that of the training dataset.
Components in a model hosting environment can participate in an automation workflow through its API, scripting, or IaC deployment (such as AWS CloudFormation). For example, a RESTful endpoint can be deployed using a CloudFormation template or by invoking its API as part of an automated workflow.
From a security perspective, the model hosting environment needs to provide authentication and authorization control to manage access to both the control plane (management functions) and data plane (model endpoints). The accesses and operations that are performed against the hosting environments should be logged for auditing purposes. For operations support, a hosting environment needs to enable status logging and system monitoring to support system observability and problem troubleshooting.
The SageMaker hosting service is a fully managed model hosting service. Similar to KFServing and Seldon Core, which we reviewed earlier in this book, the SageMaker hosting service is also a multi-framework model serving service. Next, let's take a closer look at its various capabilities for enterprise-grade model hosting.
SageMaker provides built-in inference engines for multiple ML frameworks, including scikit-learn, XGBoost, TensorFlow, PyTorch, and Spark ML. SageMaker supplies these built-in inference engines as Docker containers. To stand up an API endpoint to serve a model, you just need to provide the model artifacts and infrastructure configuration. The following is a list of model serving options:
from sagemaker.tensorflow.serving import Model
tensorflow_model = Model(
model_data=<S3 location of the Spark ML model artifacts>,
role=<AWS IAM role>,
framework_version=<tensorflow version>
)
tensorflow_model.deploy(
initial_instance_count=<instance count>, instance_type=<instance type>
)
from sagemaker.pytorch.model import PyTorchModel
pytorch_model = PyTorchModel(
model_data=<S3 location of the PyTorch model artifacts>,
role=<AWS IAM role>,
framework_version=<PyTorch version>
)
pytorch_model.deploy(
initial_instance_count=<instance count>, instance_type=<instance type>
)
import sagemaker
from sagemaker.sparkml.model import SparkMLModel
sparkml_model = SparkMLModel(
model_data=<S3 location of the Spark ML model artifacts>,
role=<AWS IAM role>,
sagemaker_session=sagemaker.Session(),
name=<Model name>,
env={"SAGEMAKER_SPARKML_SCHEMA": <schema_json>}
)
sparkml_model.deploy(
initial_instance_count=<instance count>, instance_type=<instance type>
)
from sagemaker.xgboost.model import XGBoostModel
xgboost_model = XGBoostModel(
model_data=<S3 location of the Xgboost ML model artifacts>,
role=<AWS IAM role>,
entry_point=<entry python script>,
framework_version=<xgboost version>
)
xgboost_model.deploy(
instance_type=<instance type>,
initial_instance_count=<instance count>
)
from sagemaker.sklearn.model import SKLearnModel
sklearn_model = SKLearnModel(
model_data=<S3 location of the Xgboost ML model artifacts>,
role=<AWS IAM role>,
entry_point=<entry python script>,
framework_version=<scikit-learn version>
)
sklearn_model.deploy(instance_type=<instance type>, initial_instance_count=<instance count>)
from sagemaker.model import Model
custom_model = Model(
Image_uri = <custom model inference container image uri>,
model_data=<S3 location of the ML model artifacts>,
role=<AWS IAM role>,
framework_version=<scikit-learn version>
)
custom_model.deploy(instance_type=<instance type>, initial_instance_count=<instance count>)
SageMaker hosting provides an inference pipeline feature that allows you to create a linear sequence of containers (up to 15) to perform custom data processing before and after invoking a model for predictions. SageMaker hosting can support traffic splits between multiple versions of a model for A/B testing.
SageMaker hosting can be provisioned using an AWS CloudFormation template. There is also support for the AWS CLI for scripting automation, and it can be integrated into custom applications via its API. The following are some code samples for different endpoint deployment automation methods:
Description: "Model hosting cloudformation template"
Resources:
Endpoint:
Type: "AWS::SageMaker::Endpoint"
Properties:
EndpointConfigName:
!GetAtt EndpointConfig.EndpointConfigName
EndpointConfig:
Type: "AWS::SageMaker::EndpointConfig"
Properties:
ProductionVariants:
- InitialInstanceCount: 1
InitialVariantWeight: 1.0
InstanceType: ml.t2.large
ModelName: !GetAtt Model.ModelName
VariantName: !GetAtt Model.ModelName
Model:
Type: "AWS::SageMaker::Model"
Properties:
PrimaryContainer:
Image: <container uri>
ExecutionRoleArn: !GetAtt ExecutionRole.Arn
...
Aws sagemaker create-model --model-name <value> --execution-role-arn <value>
aws sagemaker Create-endpoint-config --endpoint-config-name <value> --production-variants <value>
aws sagemaker Create-endpoint --endpoint-name <value> --endpoint-config-name <value>
If the built-in inference engines do not meet your requirements, you can also bring your own Docker container to serve your ML models.
The SageMaker hosting service uses AWS IAM as the mechanism to control access to its control plane APIs (for example, an API for creating an endpoint) and data plane APIs (for example, an API for invoking a hosted model endpoint). If you need to support other authentication methods for the data plane API, such as OpenID Connect (OIDC), you can put a proxy service as the frontend to manage user authentication. A common pattern is to use AWS API Gateway to frontend the SageMaker API for custom authentication management, as well as other API management features such as metering and throttling management.
SageMaker provides out-of-the-box monitoring and logging capabilities to assist with support operations. It monitors both system resource metrics (for example, CPU/GPU utilization) and model invocation metrics (for example, the number of invocations, model latencies, and failures). These monitoring metrics and any model processing logs are captured by AWS CloudWatch metrics and CloudWatch Logs.
Similar to the DevOps practice, which has been widely adopted for the traditional software development and deployment process, the MLOps practice is intended to streamline the building and deployment processes of ML pipelines and improve the collaborations between data scientists/ML engineers, data engineering, and the operations team. Specifically, an MLOps practice is intended to deliver the following main benefits in an end-to-end ML life cycle:
Now that we are familiar with the intended goals and benefits of the MLOps practice, let's look at the specific operational process and concrete technology architecture of MLOps on AWS.
One of the most important MLOps concepts is the automation pipeline, which executes a sequence of tasks, such as data processing, model training, and model deployment. This pipeline can be a linear sequence of steps or a more complex DAG with parallel execution for multiple tasks. An MLOps architecture also has several repositories for storing different assets and metadata as part of pipeline executions. The following diagram shows the core components and tasks involved in an MLOps operation:
A code repository in an MLOps architecture not only serves as a source code control mechanism for data scientists and engineers – it is also the triggering mechanism to kick off different pipeline executions. For example, when a data scientist checks an updated training script into the code repository, a model training pipeline execution can be triggered.
A feature repository stores reusable ML features and can be the target of a data processing/feature engineering job. The features from the feature repository can be a part of the training datasets where applicable. The feature repository is also used as a part of the model inference request.
A container repository stores the container images that are used for data processing tasks, model training jobs, and model inference engines. It is usually the target of the container building pipeline.
A model registry keeps an inventory of trained models, along with all the metadata associated with the model, such as its algorithm, hyperparameters, model metrics, and training dataset location. It also maintains the status of the model life cycle, such as its deployment approval status.
A pipeline repository maintains the definition of automation pipelines and the statuses of different pipeline job executions.
In an enterprise setting, a task ticket also needs to be created when different tasks, such as model deployment, are performed, so that these actions can be tracked in a common enterprise ticketing management system. To support audit requirements, the lineage of different pipeline tasks and their associated artifacts need to be tracked.
Another critical component of the MLOps architecture is monitoring. In general, you want to monitor items such as the pipeline's execution status, model training status, and model endpoint status. Model endpoint monitoring can also include system/resource performance monitoring, model statistical metrics monitoring, drift and outlier monitoring, and model explainability monitoring. Alerts can be triggered on certain execution statuses to invoke human or automation actions that are needed.
AWS provides multiple technology options for implementing an MLOps architecture. The following diagram shows where these technology services fit in an enterprise MLOps architecture:
As we mentioned earlier, the shared service environment hosts common tools for pipeline management and execution, as well as common repositories such as code repositories and model registries.
Here, we use AWS CodePipeline to orchestrate the overall CI/CD pipeline. AWS CodePipeline is a continuous delivery service that integrates natively with different code repositories such as AWS CodeCommit and Bitbucket. It can source files from the code repository and make them available to downstream tasks such as building containers using the AWS CodeBuild service, or training models in the model training environment. You can create different pipelines to meet different needs. A pipeline can be triggered on-demand via an API or the CodePipeline management console, or it can be triggered by code changes in a code repository. Depending on your requirements, you can create different pipelines. In the proceeding diagram, we can see four example pipelines:
A code repository is one of the most essential components in an MLOps environment. It is not only used by data scientists/ML engineers and other engineers to persist code artifacts, but it also serves as a triggering mechanism for a CI/CD pipeline. This means that when a data scientist/ML engineer commits a code change, it can automatically kick off a CI/CD pipeline. For example, if the data scientist makes a change to the model training script and wants to test the automated training pipeline in the development environment, he/she can commit the code to a development branch to kick off a model training pipeline in the dev environment. When it is ready for production release deployment, the data scientist can commit/merge the code to a release branch to kick off the production release pipelines.
In this MLOps architecture, we use AWS Elastic Container Registry (ECR) as the central container registry service. ECR is used to store containers for data processing, model training, and model inference. You can tag the container images to indicate different life cycle statuses, such as development or production.
The SageMaker model registry is used as the central model repository. The central model repository can reside in the shared service environment, so it can be accessed by different projects. All the models that go through the formal training and deployment cycles should be managed and tracked in the central model repository.
SageMaker Feature Store provides a common feature repository for reusable features to be used by different projects. It can reside in the shared services environment or be part of the data platform. Features are normally pre-calculated in a data management environment and sent to SageMaker Feature Store for offline model training in the model training environment, as well as online inferences by the different model hosting environments.
SageMaker Experiments is used to track experiments and trials. The metadata and artifacts that are generated by the different components in a pipeline execution can be tracked in SageMaker Experiments. For example, the processing step in a pipeline can contain metadata such as the locations of input data and processed data, while the model training step can contain metadata such as the algorithm and hyperparameters for training, model metrics, and the location of the model artifact. This metadata can be used to compare the different runs of model training, and they can also be used to establish model lineage.
The ML platform presents some unique challenges in terms of monitoring. In addition to monitoring common software system-related metrics and statuses, such as infrastructure utilization and processing status, an ML platform also needs to monitor model and data-specific metrics and performances. Also, unlike traditional system-level monitoring, which is fairly straightforward to understand, the opaqueness of ML models makes it inherently difficult to understand the system. Now, let's take a closer look at the three main areas of monitoring for an ML platform.
Model training monitoring provides visibility into the training progress and helps identify training bottlenecks and error conditions during the training process. It enables operational processes such as training job progress reporting and response, model training performance progress evaluation and response, training problem troubleshooting, and data and model bias detection and model interpretability and response. Specifically, we want to monitor the following key metrics and conditions during model training:
There are multiple native AWS services for building out a model training architecture on AWS. The following diagram shows an example architecture for building a monitoring solution for a SageMaker-based model training environment:
This architecture lets you monitor training and system metrics and perform log capture and processing, training event capture and processing, and model training bias and explainability reporting. It helps enable operation processes, such as training progress and status reporting, model metric evaluation, system resource utilization reporting and response, training problem troubleshooting, bias detection, and model decision explainability.
During model training, SageMaker can emit model training metrics, such as training loss and accuracy, to AWS CloudWatch to help with model training evaluation. AWS CloudWatch is the AWS monitoring and observability service. It collects metrics and logs from other AWS services and provides dashboards for visualizing and analyzing these metrics and logs. System utilization metrics (such as CPU/GPU/memory utilization) are also reported to CloudWatch for analysis to help you understand any infrastructure constraints or under-utilization. CloudWatch alarms can be created for a single metric or composite metrics to automate notifications or responses. For example, you can create alarms on low CPU/GPU utilization to help proactively identify sub-optimal hardware configuration for the training job. And when an alarm is triggered, it can send automated notifications (such as SMS and emails) to support for review via AWS Simple Notification Service (SNS).
You can use CloudWatch Logs to collect, monitor, and analyze the logs that are emitted by your training jobs. You can use these captured logs to understand the progress of your training jobs and identify errors and patterns to help troubleshoot any model training problems. For example, the CloudWatch Logs logs might contain errors such as insufficient GPU memory to run model training or permission issues when accessing specific resources to help you troubleshoot model training problems. By default, CloudWatch Logs provides a UI tool called CloudWatch Logs Insights for interactively analyzing logs using a purpose-built query language. Alternatively, these logs can also be forwarded to an Elasticsearch cluster for analysis and querying. These logs can be aggregated in a designated logging and monitoring account to centrally manage log access and analysis.
SageMaker training jobs can also send events, such as a training job status changing from running to complete. You can create automated notification and response mechanisms based on these different events. For example, you can send out notifications to data scientists when a training job is either completed successfully or failed, along with a failure reason. You can also automate responses to these failures to the different statuses, such as model retraining on a particular failure condition.
The SageMaker Clarify component can detect data and model bias and provide model explainability reporting on the trained model. You can access bias and model explainability reports inside the SageMaker Studio UI or SageMaker APIs.
The SageMaker Debugger component can detect model training issues such as non-converging conditions, resource utilization bottlenecks, overfitting, vanishing gradients, or conditions where the gradients become too small for efficient parameter updates. Alerts can be sent when training anomalies are found.
Model endpoint monitoring provides visibility into the performance of the modeling serving infrastructure, as well as model-specific metrics such as data drift, model drift, and inference explainability. The following are some of the key metrics for model endpoint monitoring:
The model monitoring architecture relies on many of the same AWS services, including CloudWatch, EventBridge, and SNS. The following diagram shows an architecture pattern for a SageMaker-based model monitoring solution:
This architecture works similarly to the model training architecture. CloudWatch metrics capture endpoint metrics such as CPU/GPU utilization and model invocation metrics (number of invocations and errors) and model latencies. These metrics help with operations such as hardware optimization and endpoint scaling.
CloudWatch Logs captures logs that are emitted by the model serving endpoint to help us understand the status and troubleshoot technical problems.
Similarly, endpoint events, such as the status changing from Creating to InService, can help you build automated notification pipelines to kick off corrective actions or provide status updates.
In addition to system and status-related monitoring, this architecture also supports data and model-specific monitoring through a combination of SageMaker Model Monitor and SageMaker Clarify. Specifically, SageMaker Model Monitor can help you monitor data drift and model quality.
For data drift, SageMaker Monitor can use the training dataset to create baseline statistics metrics such as standard deviation, mean, max, min, and data distribution for the dataset features. It uses these metrics and other data characteristics, such as data types and completeness, to establish constraints. Then, it captures the input data in the production environment, calculates the metrics, compares them with the baseline metrics/constraints, and reports baseline drifts. Model Monitor can also report data quality issues such as incorrect data types and missing values. Data drift metrics can be sent to CloudWatch metrics for visualization and analysis, and CloudWatch Alarms can be configured to trigger a notification or automated response when a metric crosses a predefined threshold.
For model quality monitoring, it creates baseline metrics (such as MAE for regression and accuracy for classification) using the baseline dataset, which contains both predictions and true labels. Then, it captures the predictions in production, ingests ground truth labels, and merges the ground truth with the predictions to calculate various regression and classification metrics before comparing those with the baseline metrics. Similar to data drift metrics, model quality metrics can be sent to CloudWatch Metrics for analysis and visualization, and CloudWatch Alarms can be configured for automated notifications and/or responses. The following diagram shows how SageMaker Model Monitor works:
For bias detection, SageMaker Clarify can monitor bias metrics for deployed models continuously and raises alerts through CloudWatch when a metric crosses a threshold. We will cover bias detection in detail in Chapter 11, ML Governance, Bias, Explainability, and Privacy.
The ML pipeline's execution needs to be monitored for statuses and errors, so corrective actions can be taken as needed. During a pipeline execution, there are pipeline-level statuses/events as well as stage-level and action-level statuses/events. You can use these events and statuses to understand the progress of each pipeline and stage and get alerted when something is wrong. The following diagram shows how AWS CodePipeline, CodeBuild, and CodeCommit can work with CloudWatch, CloudWatch Logs, and EventBridge for general status monitoring and reporting, as well as problem troubleshooting:
CodeBuild can send metrics, such as SuceededBuilds, FailedBuilds, and Duration metrics. These CodeBuild metrics can be accessed through both the CodeBuild console and the CloudWatch dashboard.
CodeBuild, CodeCommit, and CodePipeline can all emit events to EventBridge to report detailed status changes and trigger custom event processing, such as notifications, or log the events to another data repository for event archiving. All three services can send detailed logs to CloudWatch Logs to support operations such as troubleshooting or detailed error reporting.
Step Functions also provides a list of monitoring metrics to CloudWatch, such as execution metrics (such as execution failure, success, abort, and timeout) and activity metrics (such as activity started, scheduled, and succeeded). You can view these metrics in the management console and set a threshold to set up alerts.
Another key component of enterprise-scale ML platform management is service provisioning management. For large-scale service provisioning and deployment, an automated and controlled process should be adopted. Here, we will focus on provisioning the ML platform itself, not provisioning AWS accounts and networking, which should be established as the base environment for ML platform provisioning in advance. For ML platform provisioning, there are the following two main provisioning tasks:
There are multiple technical approaches to automating service provisioning on AWS, such as using provisioning shell scripts, CloudFormation scripts, and AWS Service Catalog. With shell scripts, you can sequentially call the different AWS CLI commands in a script to provision different components, such as creating a SageMaker notebook. CloudFormation is the IaC service for infrastructure deployment on AWS. With CloudFormation, you create templates that describe the desired resources and dependencies that can be launched as a single stack. When the template is executed, all the resources and dependencies specified in the stack will be deployed automatically. The following code shows the template for deploying a SageMaker Studio domain:
Type: AWS::SageMaker::Domain
Properties:
AppNetworkAccessType: String
AuthMode: String
DefaultUserSettings:
UserSettings
DomainName: String
KmsKeyId: String
SubnetIds:
- String
Tags:
- Tag
VpcId: String
AWS Service Catalog allows you to create different IT products to be deployed on AWS. These IT products can include SageMakenotebooks, a CodeCommit repository, and CodePipeline workflow definitions. AWS Service Catalog uses CloudFormation templates to describe IT products. With Service Catalog, administrators create IT products with CloudFormation templates, organize these products by product portfolio, and entitle end users with access. The end users then access the products from the Service Catalog product portfolio. The following diagram shows the flow of creating a Service Catalog product and launching the product from the Service Catalog service:
For large-scale and governed IT product management, Service Catalog is the recommended approach. Service Catalog supports multiple deployment options, including single AWS account deployments and hub-and-spoke cross-account deployments. A hub-and-spoke deployment allows you to centrally manage all the products and make them available in different accounts. In our enterprise ML reference architecture, we use the hub-and-spoke architecture to support the provisioning of data science environments and ML pipelines, as shown in the following diagram:
In the preceding architecture, we set up the central portfolio in the shared services account. All the products, such as creating new Studio domains, new Studio user profiles, CodePipeline definitions, and training pipeline definitions, are centrally managed in the central hub account. Some products are shared with the different data science accounts to create data science environments for data scientists and teams. Some other products are shared with model training accounts for standing up ML training pipelines.
With that, we have talked about the core components of an enterprise-grade ML platform. Next, let's get hands-on and build a pipeline to automate model training and deployment.
In this hands-on exercise, you will get hands on with building a simplified version of the enterprise MLOps pipeline. For simplicity, we will not be using the multi-account architecture for the enterprise pattern. Instead, we will build several core functions in a single AWS account. The following diagram shows what you will be building:
At a high level, you will create two pipelines using CloudFormation: one for model training and one for model deployment.
In this section, we will create two CloudFormation templates that do the following:
Now, let's get started with the CloudFormation template for the Step Functions workflow:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreateModel",
"sagemaker:DeleteEndpointConfig",
"sagemaker:DescribeTrainingJob",
"sagemaker:CreateEndpoint",
"sagemaker:StopTrainingJob",
"sagemaker:CreateTrainingJob",
"sagemaker:UpdateEndpoint",
"sagemaker:CreateEndpointConfig",
"sagemaker:DeleteEndpoint"
],
"Resource": [
"arn:aws:sagemaker:*:*:*"
]
},
...
}
AWSTemplateFormatVersion: 2010-09-09
Description: 'AWS Step Functions sample project for training a model and save the model'
Parameters:
StepFunctionExecutionRoleArn:
Type: String
Description: Enter the role for Step Function Workflow execution
ConstraintDescription: requires a valid arn value
AllowedPattern: 'arn:aws:iam::w+:role/.*'
Resources:
TrainingStateMachine2:
Type: AWS::StepFunctions::StateMachine
Properties:
RoleArn: !Ref StepFunctionExecutionRoleArn
DefinitionString: !Sub |
{
"StartAt": "SageMaker Training Step",
"States": {
"SageMaker Training Step": {
"Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync",
...
{
"TrainingImage": "<aws hosting account>.dkr.ecr.<aws region>.amazonaws.com/pytorch-training:1.3.1-gpu-py3",
"S3OutputPath": "s3://<your s3 bucket name>/sagemaker/pytorch-bert-financetext",
"SageMakerRoleArn": "arn:aws:iam::<your aws account>:role/service-role/<your sagemaker execution role>",
"S3UriTraining": "s3://<your AWS S3 bucket>/sagemaker/pytorch-bert-financetext/train.csv",
"S3UriTesting": "s3://<your AWS S3 bucket>/sagemaker/pytorch-bert-financetext/test.csv",
"InferenceImage": " aws hosting account>.dkr.ecr. <aws region>.amazonaws.com/pytorch-inference:1.3.1-cpu-py3",
"SAGEMAKER_PROGRAM": "train.py",
"SAGEMAKER_SUBMIT_DIRECTORY": "s3:// <your AWS S3 bucket> /berttraining/source/sourcedir.tar.gz",
"SAGEMAKER_REGION": "<your aws region>"
}
Now, we are ready to create the CloudFormation template for the CodePipeline training pipeline. This pipeline will listen to changes to a CodeCommit repository and invoke the Step Functions workflow we just created:
Parameters:
BranchName:
Description: CodeCommit branch name
Type: String
Default: master
RepositoryName:
Description: CodeCommit repository name
Type: String
Default: MLSA-repo
ProjectName:
Description: ML project name
Type: String
Default: FinanceSentiment
MlOpsStepFunctionArn:
Description: Step Function Arn
Type: String
Default: arn:aws:states:ca-central-1:300165273893:stateMachine:TrainingStateMachine2-89fJblFk0h7b
Resources:
CodePipelineArtifactStoreBucket:
Type: 'AWS::S3::Bucket'
DeletionPolicy: Delete
Pipeline:
Type: 'AWS::CodePipeline::Pipeline'
...
We want to be able to kick off the CodePipeline execution when a change is made (such as a code commit) in the CodeCommit repository. To enable this, we need to create a CloudWatch event that monitors this change and kicks off the pipeline. Let's get started:
AmazonCloudWatchEventRole:
Type: 'AWS::IAM::Role'
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- events.amazonaws.com
Action: 'sts:AssumeRole'
Path: /
Policies:
- PolicyName: cwe-pipeline-execution
PolicyDocument:
...
Congratulations! you have successfully used CloudFormation to build a CodePipeline-based ML training pipeline that automatically runs when there is a file change in a CodeCommit repository. Next, let's build the ML deployment pipeline for the model.
To start creating a deployment, perform the following steps:
Description: Basic Hosting of registered model
Parameters:
ModelName:
Description: Model Name
Type: String
Default: <mode name>
Resources:
Endpoint:
Type: AWS::SageMaker::Endpoint
Properties:
EndpointConfigName: !GetAtt EndpointConfig.EndpointConfigName
EndpointConfig:
Type: AWS::SageMaker::EndpointConfig
Properties:
ProductionVariants:
InitialInstanceCount: 1
InitialVariantWeight: 1.0
InstanceType: ml.m4.xlarge
ModelName: !Ref ModelName
VariantName: !Ref ModelName
Outputs:
EndpointId:
Value: !Ref Endpoint
EndpointName:
Value: !GetAtt Endpoint.EndpointName
{
"Parameters" : {
"ModelName" : <name of the financial sentiment model you have trained>
}
}
This CloudFormation template also creates supporting resources, including an S3 bucket for storing the CodePipeline artifacts, an IAM role for CodePipeline to run with, and another IAM role for CloudFormation to use to create the stack for mldeployment.yaml.
Parameters:
BranchName:
Description: CodeCommit branch name
Type: String
Default: master
RepositoryName:
Description: CodeCommit repository name
Type: String
Default: MLSA-repo
ProjectName:
Description: ML project name
Type: String
Default: FinanceSentiment
CodePipelineSNSTopic:
Description: SNS topic for NotificationArn
Default: arn:aws:sns:ca-central-1:300165273893:CodePipelineSNSTopicApproval
Type: String
ProdStackConfig:
Default: mldeploymentconfig.json
Description: The configuration file name for the production WordPress stack
Type: String
ProdStackName:
Default: FinanceSentimentMLStack1
Description: A name for the production WordPress stack
Type: String
TemplateFileName:
Default: mldeployment.yaml
Description: The file name of the WordPress template
Type: String
ChangeSetName:
Default: FinanceSentimentchangeset
Description: A name for the production stack change set
Type: String
Resources:
CodePipelineArtifactStoreBucket:
Type: 'AWS::S3::Bucket'
DeletionPolicy: Delete
Pipeline:
. . . . .
Congratulations! You have successfully created and run a CodePipeline deployment pipeline to deploy a model from the SageMaker model registry.
In this chapter, we discussed the key requirements for building an enterprise ML platform to meet needs such as end-to-end ML life cycle support, process automation, and separating different environments. We also talked about architecture patterns and how to build an enterprise ML platform on AWS using AWS services. We discussed the core capabilities of different ML environments, including training, hosting, and shared services. You should now have a good understanding of what an enterprise ML platform could look like, as well as the key considerations for building one using AWS services. You have also developed some hands-on experience in building the components of the MLOps architecture and automating model training and deployment. In the next chapter, we will discuss advanced ML engineering by covering large-scale distributed training and the core concepts for achieving low-latency inference.
18.206.76.160