In the previous chapter, you learned about anomaly detection problems and some ways to tackle them. You also had an overview of Amazon Lookout for Equipment, an AI-/ML-managed service designed to build anomaly detection problems in multivariate, industrial time series data.
The goal of this chapter is to teach you how to create and organize multivariate datasets, how to create a JSON schema to prepare the dataset ingestion, and how to trigger a data ingestion job pointing to the S3 bucket where your raw data is stored.
In addition, you will also have a high-level understanding of all the heavy lifting the service is performing on your behalf to save as much data preparation effort as possible (imputation, time series alignment, resampling). You will also understand what kind of errors can be raised by the service and how to work around them.
In this chapter, we're going to cover the following main topics:
No hands-on experience in a language such as Python or R is necessary to follow along with the content of this chapter. However, we highly recommend that you read this chapter while connected to your own AWS account and open the Amazon Lookout for Equipment console to run the different actions on your end.
To create an AWS account and log in to the Amazon Lookout for Equipment console, you can refer yourself to the technical requirements of Chapter 2, An Overview of Amazon Forecast.
In the companion GitHub repository of this book, you will find a notebook that will show you the detailed steps to prepare the dataset we are going to use from now on. This preparation is optional to follow along with this chapter. At your first reading, I recommend that you download the prepared dataset from the following link:
From there, you can log in to the AWS console and follow along with this chapter without writing a single line of code.
At a later reading, feel free to go through the preparation code to understand how to prepare a dataset ready to be consumed by Amazon Lookout for Equipment. You will find a notebook with all this preparation on the companion GitHub repository of this book:
This notebook will help you understand the format expected to build a successful model.
Before you can train an anomaly detection model, you need to prepare a multivariate time series dataset. In this section, you will learn how to prepare such a dataset and how to allow Amazon Lookout for Equipment to access it.
The dataset we are going to use is a cleaned-up version of the one that can be found on Kaggle here:
https://www.kaggle.com/nphantawee/pump-sensor-data/version/1
This dataset contains known time ranges when a pump is broken and when it is operating under nominal conditions. To adapt this dataset so that it can be fed to Amazon Lookout for Equipment, perform the following steps:
No other specific preprocessing was necessary and you will now upload this prepared dataset in a location where Amazon Lookout for Equipment can access it.
You can download the archive I prepared directly from the following location:
Download this archive and unzip it. You should have the following:
In the next section, you are going to create an Amazon S3 bucket, upload your dataset there, and give Amazon Lookout for Equipment permission to access this data.
Equipped with our prepared datasets, let's create a bucket on Amazon S3 and upload our data there:
Important Note
At the time of writing this book, Amazon Lookout for Equipment is only available in the following regions: US East (N. Virginia), Asia Pacific (Seoul), and Europe (Ireland). Make sure you select one of these regions to create your bucket or you won't be able to ingest your data into Amazon Lookout for Equipment.
Your Amazon S3 bucket is now created and you can start uploading your files to this location.
To upload your dataset, complete the following steps:
This page lists all the objects available in this bucket: it is empty for now.
Once your upload is complete, you should see two folders in the objects list, one for the label data and the other (train-data) containing the time series sensor data you are going to use for training a model.
s3://BUCKET_NAME/FOLDER_NAME/
For instance, my train-data folder S3 URI is the following:
s3://timeseries-on-aws-lookout-equipment-michaelhoarau/train-data/
We will need the train-data link at ingestion time and the label-data link when training a new model. Let's now give access to this S3 bucket to Amazon Lookout for Equipment.
By default, the security mechanisms enforced between different AWS services will forbid any service other than Amazon S3 from accessing your data. From your account, you can upload, delete, or move your data from the bucket you just created. Amazon Lookout for Equipment, however, is a different service and will not be able to access this data. We need to specify that it can access any data in this bucket.
You can configure this access directly from the Amazon Lookout for Equipment console during the ingestion step. However, if you want to have more control over the roles and the different accesses created within your account, you can read through this section. Otherwise, feel free to skip it and come back here later.
To enable access to your S3 bucket to the Amazon Lookout for Equipment service, we are going to use the AWS Identity and Access Management (IAM) service to create a dedicated IAM role:
Note
Not all AWS services appear in these ready-to-use use cases, and this is why we are using Amazon SageMaker (another AWS Managed Service). In the next steps, we will adjust the role created to configure it specifically for Amazon Lookout for Equipment.
You are brought back to the list of roles available in your account. In the search bar, search for the role you just created and choose it from the returned result.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<<YOUR-BUCKET>>/*",
"arn:aws:s3:::<<YOUR-BUCKET>>"
]
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "lookoutequipment.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
When Amazon Lookout for Equipment tries to read the datasets you just uploaded in S3, it will request permissions from IAM by using the role we just created:
You are now ready to ingest your dataset into Amazon Lookout for Equipment.
As mentioned in Chapter 8, An Overview of Amazon Lookout for Equipment, a dataset is a convenient way to organize your time series data stored as CSV files. These files are stored in an Amazon S3 bucket and organized in different folders.
Each dataset is a container that can contain one or several components that groups tags together. In S3, each component will be materialized by a folder. You can use datasets and components to organize your sensor data depending on how your industrial pieces of equipment are organized themselves.
For instance, you can use the dataset level to store all the tags from a factory and then each component to group all the tags relative to a given production line (across multiple pieces of equipment) or a given piece of equipment.
In this configuration, each component contains several sensors in the same CSV file. Depending on how your sensor data is generated, you may have to align all the timestamps so that you can join all your sensor data in a single CSV file sharing a common timestamp column.
An alternative is to match each piece of equipment with a dataset and then store each sensor in its own component.
In this configuration, the timestamps for each sensor do not need to be aligned. This timestamp alignment will be dealt with on your behalf at ingestion time by Amazon Lookout for Equipment.
For the industrial pump dataset, throughout this tutorial, you will use the second format described previously, to remove the need for any preprocessing. To create your first anomaly detection project with Amazon Lookout for Equipment, complete the following steps:
The dataset is created and you are brought to the dataset dashboard. Before you actually ingest your data, let's take a short pause to look at the dataset schema.
To create a dataset in Amazon Lookout for Equipment, you need to describe the list of sensors it contains. You will do this by writing a data schema in JSON format. In this section, you will learn how to write one manually and how to generate one automatically.
The data schema to describe an Amazon Lookout for Equipment dataset is a JSON string that takes the following format:
{
"Components": [
{
"ComponentName": "string",
"Columns": [
{ "Name": "string", "Type": "DOUBLE" | "DATETIME" },
…
{ "Name": "string", "Type": "DOUBLE" | "DATETIME" }
]
},
…
]
}
The root item (Components) of this JSON string lists all the components of the dataset. Then, each component is a dictionary containing the following:
Writing such a schema from scratch can be hard to get right, especially for datasets that contain several dozen, if not hundreds, of signals shared between multiple components. In the next section, I will show you how to programmatically generate a data schema file.
Building a schema from scratch manually is a highly error-prone activity. In this section, you are going to use the CloudShell service to run a script that will build this schema automatically from an S3 path you provide it. Let's go through the following instructions:
python3 -m pip install --quiet s3fs pandas
wget https://github.com/PacktPublishing/Time-Series-Analysis-on-AWS/blob/main/Chapter09/create_schema.py
python3 create_schema.py s3://<<YOUR_BUCKET>>/ train-data/
After a few seconds this script will output a long JSON string starting with {"Components":…. You can now copy this string when you need to create a dataset based on this time series dataset.
Note
After running the first line from the preceding code, you might see some error messages about pip dependencies. You can ignore them and run the dataset generation script.
You have seen how you can create a Lookout for Equipment dataset and how you can generate a schema with a script. Let's now ingest actual data in our dataset.
At the end of dataset creation, you are automatically brought to the dataset dashboard.
From here, you can click the Ingest data button in the Step 2 column and start configuring the data ingestion job.
When configuring the ingestion, the job details you must define are as follows:
Once you're done, you can click on Ingest to start the ingestion process. Your process starts and can last around 5-6 minutes depending on the GB of data to be ingested. During the ingestion process, Amazon Lookout for Equipment checks your data, performs some missing value filling on your behalf, aligns the timestamps, and prepares the sequence of time series so that they are ready to be used at training time by the multiple algorithms the service can run.
When ingesting data, Amazon Lookout for Equipment performs several checks that can result in a failed ingestion. When this happens, you can go back to your dataset dashboard and click on the View data source button. You will then access a list of all the ingestion jobs you performed.
This screen lists all the ingestion jobs you performed. When ingestion fails, the Success icon is replaced by a Failed one and you can hover your mouse on it to read about what happened. There are three main sources of errors linked to your dataset. Two of them can happen at ingestion time, while the last one will only be triggered when you train a model:
Let's dive into each of these issues.
The S3 location you used when you configured your ingestion job may not be the right one. Remember that your dataset can contain several components. The S3 URI you should mention is the root folder that contains the folders materializing these components. This is a common mistake when you have only one component.
When you see this error, you just need to relaunch the ingestion process and set the correct S3 location before attempting the ingestion again.
When you ingest your data, Amazon Lookout for Equipment matches the data structure as being available in S3 with the data schema you used when configuring your dataset. If a component is missing (an entire folder), if a tag is missing from a given component, or if a tag is misplaced and positioned in the wrong component file, you will find this error.
Double-check the file structure in your S3 bucket and make sure it matches the schema used at dataset creation time (each folder must match a component name that is case-sensitive and each CSV file must contain exactly the sensors as listed under this component in the dataset schema).
If your file structure is correct, you may have to do the following:
If you did not generate the dataset programmatically, you may have had no opportunity to look into your time series data. Some of this data might miss a significant portion of its data points or be empty altogether (an unused or malfunctioning sensor for instance). When you train a model with Amazon Lookout for Equipment, you must choose a training start date and end date. Lookout for Equipment will check whether all the sensors have at least one data point present in this time period and forward fill any missing values.
If a time series is completely empty (100% missing values) in the selected training range, the model will fail to train. When this happens, you have the following possibilities:
Now that you understand the key issues that may happen with your dataset, it is time to conclude this chapter and move to the fun part – training and evaluating a model!
In this chapter, you learned how you should organize your dataset before you can train an anomaly detection model with Amazon Lookout for Equipment. You also discovered the dataset we are going to use in the next few chapters and you performed the first tasks necessary to train a model, namely, creating a dataset and ingesting your time series data in it.
With the help of this chapter, you should now have an understanding of what can go wrong at this stage with a few pointers to the main errors that can happen at ingestion time.
In the next chapter, we are going to train and evaluate our first anomaly detection model.
3.141.2.34