Use case – building and applying an auto-encoder model

For our use case, we are using the actigraphy data from smartphones we have previously examined. These data include actimetry on a number of individuals while sitting, standing, lying, walking, walking downstairs, and walking upstairs. Our goal is to identify any anomalous values or values that are aberrant or otherwise unusual.

To start with, we will load the training and testing data into R and then convert it over to H2O for analysis:

use.train.x <- read.table("UCI HAR Dataset/train/X_train.txt")
use.test.x <- read.table("UCI HAR Dataset/test/X_test.txt")

use.train.y <- read.table("UCI HAR Dataset/train/y_train.txt")[[1]]
use.test.y <- read.table("UCI HAR Dataset/test/y_test.txt")[[1]]

use.labels <- read.table("UCI HAR Dataset/activity_labels.txt")

h2oactivity.train <- as.h2o(
  use.train.x,
  destination_frame = "h2oactivitytrain")

h2oactivity.test <- as.h2o(
  use.test.x,
  destination_frame = "h2oactivitytest")

With the data in, we are ready to train our model. The setup is fairly similar to the initial models we trained. Here we use two layers with 100 hidden neurons each. For the moment, there is no specific regularization used, although again, given that there are significantly fewer hidden neurons than there are input variables, the model simplicity may provide adequate regularization:

mu1 <- h2o.deeplearning(
  x = colnames(h2oactivity.train),
  training_frame= h2oactivity.train,
  validation_frame = h2oactivity.test,
  activation = "Tanh",
  autoencoder = TRUE,
  hidden = c(100, 100),
  epochs = 30,
  sparsity_beta = 0,
  input_dropout_ratio = 0,
  hidden_dropout_ratios = c(0, 0),
  l1 = 0,
  l2 = 0
)

Examining the performance of the model, it has a very low reconstruction error. This suggests that the model is sufficiently complex to capture the key features of the data. There is no substantial difference in model performance between the training and validation data:

mu1
Training Set Metrics: 
=====================

MSE: (Extract with `h2o.mse`) 0.001

H2OAutoEncoderMetrics: deeplearning
** Reported on validation data. **

Validation Set Metrics: 
=====================

MSE: (Extract with `h2o.mse`) 0.0011

We can extract how anomalous each case is and plot the distribution. The results are shown in Figure 4.7. Clearly, there are a few cases that are far more anomalous than the rest, as shown by much higher error rates:

erroru1 <- as.data.frame(h2o.anomaly(mu1, h2oactivity.train))

pue1 <- ggplot(erroru1, aes(Reconstruction.MSE)) +
  geom_histogram(binwidth = .001, fill = "grey50") +
  geom_vline(xintercept = quantile(erroru1[[1]], probs = .99), linetype = 2) +
  theme_bw()
print(pue1)
Use case – building and applying an auto-encoder model

Figure 4.7

One way to try to explore these anomalous cases further is to examine whether any of the activities tend to have more or less anomalous values. We can do this by finding which cases are anomalous, here defined as the top 1% of error rates, and then extracting the activities of those cases and plotting them. The results from this are shown in Figure 4.8. The vast majority of anomalous cases come from walking downstairs or lying down. With a high error in recreating the inputs, the deep features may be a (relatively) poor representation of the input for those cases. In practice if we were classifying based on these results, we might want to exclude these cases as they do not seem to fit the features the model has learned:

i.anomolous <- erroru1$Reconstruction.MSE >= quantile(erroru1[[1]], probs = .99)

pu.anomolous <- ggplot(as.data.frame(table(use.labels$V2[use.train.y[i.anomolous]])),
       aes(Var1, Freq)) +
  geom_bar(stat = "identity") +
  xlab("") + ylab("Frequency") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

# print the ggplot2 plot object
print(pu.anomolous)
Use case – building and applying an auto-encoder model

Figure 4.8

In this example, we used a deep auto-encoder model to learn the features of actimetry data from smartphones. Such work can be useful for excluding unknown or unusual activities, rather than incorrectly classifying them. For example, as part of an app that classifies what activity you engaged in for how many minutes, it may be better to simply leave out a few minutes where the model is uncertain or the hidden features do not adequately reconstruct the inputs, rather than to aberrantly call an activity walking or sitting when it was actually walking downstairs.

Such work can also help to identify where the model tends to have more issues. Perhaps further sensors and additional data are needed to represent walking downstairs or more could be done to understand why walking downstairs tends to produce relatively high error rates.

These deep auto-encoders are also useful in other contexts where identifying anomalies is important, such as with financial data or credit card usage patterns. Anomalous spending patterns may indicate fraud or that a credit card has been stolen. Rather than attempt to manually search through millions of credit card transactions, one could train an auto-encoder model and use it to identify anomalies for further investigation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.67.70