Chapter 9. Unsupervised Learning

Take a look at Figure 9-1, and tell me what you see.

pmlh 0901
Figure 9-1. Two scatterplots

It doesn’t matter what the x- and y-axes are in those two plots; imagine each plot represents a scientific domain. For the chart on the left you likely identified about 10 different concepts, so that domain is likely to have about 10 specialist words; some are blurring into each other, so maybe 9, maybe 11 or 12? But the domain on the right has two concepts, and is only going to have two specialist words. Or they might be describing weather conditions over a year. The chart on the left might be describing a place with lots of distinct weather patterns, so the weather becomes a talking point, and lots of weather phrases enter the vocabulary (“Is it chucking it down outside?” “No, just a light drizzle”). The chart on the right might represent the climate of Southern California, where only two weather phrases are needed (“lovely and sunny,” and “slightly cloudy”).

The point is, you didn’t need to know the subject or the “correct answer” to be able to do something useful with the data you were given, and this is a core strength of human intelligence. In machine learning, it is called unsupervised learning, and this chapter will look at some of the functionality H2O has for it.

This automatic organization of the data can be thought as a form of data compression. If you have 5000 input columns, that are quite sparse and contain lots of duplication, you might use the techniques in this chapter to reduce them to a more manageable, and information-dense, 12 columns; then you can use one of the supervised learning techniques on those 12 columns.

K-Means Clustering

The idea behind k-means is to divide your data up into k groups (you have to specify k) such that each data item is closer to the center of its cluster than to the center of any other cluster. It is doing what I asked you to do in those scatterplots earlier.

For this section I am going to use a Natural Language Processing (NLP) example. But as space is limited, and I want to keep this chapter focused, I am going to take the tf-idf (more on that in a moment) data that someone else has made, and direct you to their article and GitHub site; see the following sidebar.

The csv file has 564 columns: the first one is the name of the movie, the other 563 are the terms that were extracted. I want to use those 563 values to divide the movies up into 5 clusters; if we get lucky the set of movies in each cluster will be similar to each other. Example 9-1 shows how to do that in R. Most of it should be familiar by now; tapply is an R function to group the values in one column (movie names) by another column (each movie’s k-means group), then apply the given function (print) to each group.

Example 9-1. k-means example, in R
library(h2o)
h2o.init(nthreads = -1)

tfidf <- h2o.importFile("./datasets/movie.tfidf.csv")

m <- h2o.kmeans(tfidf, x = 2:564, k = 5,
  standardize = FALSE, init = "PlusPlus")

p <- h2o.predict(m, tfidf)

tapply(as.vector(tfidf[,1]), as.vector(p$predict), print)

Example 9-2 is how to do that in Python; see the inline comments.

Example 9-2. k-means example, in Python
import h2o
h2o.init()

tfidf = h2o.import_file("./datasets/movie.tfidf.csv")

from h2o.estimators.kmeans import H2OKMeansEstimator
m = H2OKMeansEstimator(k=5, standardize=False, init="PlusPlus")
m.train(x=range(1,564), training_frame=tfidf)

#Get the group that each movie is in
p = m.predict(tfidf)

#Join that to our movie names, then download it
d = tfidf[0].cbind(p).as_data_frame()
d.columns = ["movie","group"]

#Iterate through and print each group
for ix, g in d.groupby("group"):
    print "---",ix,"---"
    print ', '.join(g["movie"])

I set a couple of optional parameters. First init="PlusPlus",2 which I felt gave better results than the default of “Furthest,” or the other alternative, “Random.” (You can also specify your own initialization values.) I also set standardize to false, because the data is already nicely between 0.0 and 1.0. The m object tells you quite a lot of information, including how many items are in each cluster, but if you want to find out which item is in which cluster you have to ask it to predict them!

It runs quickly, a matter of seconds. Here is the first of the five groups:3

Schindler's List           One Flew Over Cuckoo Nest  Gone with the Wind
The Wizard of Oz           Lawrence of Arabia         Forrest Gump
E.T. the Extra-Terrestrial LOTR: Return of the King   Gladiator
Saving Private Ryan        Raiders of the Lost Ark    Streetcar Named Desire
Best Years of Our Lives    My Fair Lady               Ben-Hur
Doctor Zhivago             Platoon                    The Pianist
The Exorcist               The Deer Hunter            All Quiet on Western Front
Mr. Smith Goes Washington  Terms of Endearment        The Grapes of Wrath
Shane                      The Green Mile             Close Encounters 3rd Kind
The Graduate               Stagecoach                 A Clockwork Orange
Wuthering Heights

Here is the second group:

Raging Bull           Citizen Kane          Singin' in the Rain
12 Angry Men          Amadeus               Gandhi
Rocky                 To Kill a Mockingbird Braveheart
Dances with Wolves    City Lights           Good Will Hunting
Network

And the third group:

It's a Wonderful Life Philadelphia Story    American in Paris
Patton                The King's Speech     A Place in the Sun
Out of Africa         Tootsie               Giant
Nashville             Yankee Doodle Dandy

The fourth group has both Godfather movies in it; most of the Westerns seem to be in here too:

The Godfather            The Shawshank Redemption Casablanca
Titanic                  The Godfather: Part II   Psycho
Sunset Blvd.             Vertigo                  On the Waterfront
West Side Story          The Silence of the Lambs Chinatown
Some Like It Hot         Unforgiven               Good, Bad and Ugly
Butch Cassidy & Sundance Treasure of Sierra Madre The Apartment
High Noon                Goodfellas               The French Connection
It Happened One Night    Midnight Cowboy          Rain Man
Annie Hall               Fargo                    American Graffiti
Pulp Fiction             The Maltese Falcon       Taxi Driver
Double Indemnity         Rebel Without Cause      Rear Window
The Third Man            North by Northwest

And the fifth has some scary ones like Jaws and The Sound of Music:

The Sound of Music       Star Wars                2001: A Space Odyssey
Bridge on the River Kwai Dr. Strangelove          Apocalypse Now
From Here to Eternity    Jaws                     The African Queen
Mutiny on the Bounty

I’m sure you can see how this can form the basis of a “People who enjoyed Saving Private Ryan also enjoyed Platoon" movie recommendation system, but the fact that the five groups change so much from run to run makes me cautious. It might just be that the input data needs more work.4

Deep Learning Auto-Encoder

We previously had a whole chapter (Chapter 8) on using h2o.deeplearning() for supervised learning, but now we will look at it for unsupervised learning. You switch it into this mode by setting autoencoder to true. The other difference is to not set the y argument (i.e., the field you want to learn, in supervised learning).

It is a bit of a trick: it still does supervised learning, but it copies your input layer to be the output layer (aka “the answer”). In other words, it tries to learn the inputs. That might sound a bit pointless, but what is happening is that the hidden layers are being forced to summarize the data, to compress it. All the tuning knowledge we learned in the earlier chapter can be applied.

As an example of its use, I am going to take the same NLP data set that was used in “K-Means Clustering”, where we have 100 movies, and 563 terms, and see if we can reduce those 563 dimensions to just two dimensions. That is quite an ask, but I have chosen two because then I can plot the results.

The nice thing about this data set is there are only 100 rows, so the experiments can be quite quick. Example 9-3 is almost the simplest possible auto-encoder: there will be 563 input neurons, going down to just two neurons in a single hidden layer, then going to 563 output neurons. It will train for the default 10 epochs.

Example 9-3. Minimal auto-encoder example, in R
m <- h2o.deeplearning(
  2:564, training_frame = tfidf,
  hidden = c(2), autoencoder = T, activation = "Tanh"
  )
f <- h2o.deepfeatures(m, tfidf, layer = 1)

I said almost: I’ve added activation = "Tanh" instead of using the default Rectifier. You can use the default Rectifier, but:

  • It gives poor results: x or y will be zero for many of them, giving a clustering along the bottom and left.

  • In more complex auto-encoders you will get complaints of numerical instability.

So, with auto-encoders I recommend always using Tanh.5

The code in Python (Example 9-4) is quite similar, though note that layers (and column indices) count from zero, whereas in R they counted from one.

Example 9-4. Minimal auto-encoder example, in Python
m = h2o.estimators.deeplearning.H2OAutoEncoderEstimator(
  hidden=[2],
  activation="Tanh"
  )
m.train(x=range(1,564), training_frame=tfidf)
f = m.deepfeatures(tfidf, layer=0)

Despite the simplicity, it gives acceptable results, with an MSE of 0.035. Figure 9-2 shows the movies plotted by the value of the two reduced dimensions.

pmlh 0902
Figure 9-2. Movies in two dimensions, made by the simplest auto-encoder.
Note

In this series of plots only the first 30 movies are plotted each time, to stop them looking too cluttered. You will see overlapping names. These are not printing errors! It is where the algorithm has not managed to separate those two movies in the meager two dimensions it has been given.

By the way, I am plotting the first 30 with R code like this:

d <- as.matrix(f[1:30,])
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17)  #Triangle
text(d, labels, pos = 3) #pos=3 means above

Next I made 19 models, where I experimented with the amount of reduction: between 2 and 20 dimensions. For each model, hidden was set to 128,64,nodes,64,128, where nodes ranged from 2 to 20. So 563 input nodes linked to the first hidden layer with 128 nodes, then to 64 nodes in the second hidden layer, then the 2+ nodes of interest are the third hidden layer, then back up to 64, then up to 128 in the fifth and final hidden layer, and then finally out to 563 output nodes. I first tried this with 5 epochs (the upper line in Figure 9-3), and then again with 400 epochs (the mostly lower line).

pmlh 0903
Figure 9-3. Model quality (lower is better) by number of dimensions

The more dimensions in the middle hidden layer the easier it is to learn, but the steeper angle of the lower curve means it needs plenty of epochs to take advantage of this. It is curious that the first four values did worse with more epochs.

Stacked Auto-Encoder

A neural net model can be built up in stages, which is a useful technique, even with supervised learning, when you find you want more layers but learning is too slow or just not working.

Note

Terminology alert: You might also see stacking models used to mean ensembles (described in Chapter 10). But with auto-encoding neural networks it means learning one layer at a time.

Staying with the movie NLP data set:

m1 = h2o.estimators.deeplearning.H2OAutoEncoderEstimator(
  hidden = [128,64,11,64,128], activation = "Tanh", epochs = 400
  )
m1.train(x = range(1,564), training_frame = tfidf)
f1 = m1.deepfeatures(tfidf, layer = 2)

m2 = h2o.estimators.deeplearning.H2OAutoEncoderEstimator(
  hidden = [2], activation = "Tanh", epochs = 400
  )
m2.train(x = range(0,11), training_frame = f1)
f2 = m2.deepfeatures(f1, layer = 0)

What is happening is that a model m1 with 5 hidden layers, and 11 neurons in the middle layer, is trained on the raw data, tfidf.6 Then that third hidden layer is extracted into f1. f1 is a transformation of tfidf, still with 100 rows, but now only 11 columns. m2 then uses f1 as its input, and it builds a much simpler model, reducing 11 input nodes to 2 hidden nodes then back out to 11 output nodes. At the end the results are put in f2. f2 has 2 columns, but still has 100 rows, one for each movie.

I can’t objectively say if Figure 9-4 looks any better: this is unsupervised learning, after all! But the key point here is that f1 could have been used as the training frame into any algorithm, whether another auto-encoder, as here, or a supervised learning algorithm such as random forest.7

Tip

A common example you might see is to auto-encode the MNIST data set, into two columns, which you then plot to show the digits (normally color-coded). If you have got it right you should see each digit clustering together nicely. It is also interesting to see the different clustering you get when using PCA (see “Principal Component Analysis”) to reduce the same data to two dimensions.

You might then train on just those two columns. Or add them to the data, just like the enhanced data that was added before (see “Helping the Models”).

pmlh 0904
Figure 9-4. Movies in two dimensions, made by a stacked auto-encoder

Principal Component Analysis

Principal component analysis is normally called PCA, though the H2O API and R call it prcomp. It is another way to reduce the dimensionality of numeric data. Wikipedia tells me “PCA can be done by eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix.” Fortunately I don’t need to know my eigenvalue from my eigenvector to be able to use it.

I am going to call PCA with k = 2, meaning I just want to get the first two principal components. The first one will be the x-axis in my plot, and accounts for as much variability in the data as it can. The second one, which will become my y-axis, gets as much of the remaining variability as it can, while being orthogonal to the first principal component. So, with PCA each additional dimension brings along less information. This is in contrast to using an auto-encoder to reduce dimensions—there the dimensions are all equal citizens. (See Figure 11-4 in Chapter 11 for a visual example of the difference.)

The data, and application, is the same as shown in “Deep Learning Auto-Encoder”: take the tf-idf scores for 563 terms used to describe 100 movies, then reduce the 563 dimensions to just two, so that they can be plotted.

Here is the complete code, in R:

library(h2o)
h2o.init(nthreads = -1)

tfidf <- h2o.importFile("./datasets/movie.tfidf.csv")
m <- h2o.prcomp(tfidf, 2:564, k = 2)
p <- h2o.predict(m, tfidf)

After running that code p will have 2 columns and 100 rows, and plotted it looks like Figure 9-5 (again, just the top 30 movies, to avoid it getting overly messy).

pmlh 0905
Figure 9-5. Movies organized by the first two principal components

Despite a few more overlaps, it looks just as plausible as any of the other plots of this data: the two Godfather movies are quite close, as are The Wizard of Oz and Silence of the Lambs.

GLRM

GLRM stands for Generalized Low Rank Model, and it’s another algorithm for reducing the number of columns, while maintaining as much information as possible.8 The additional thing that GLRM brings is being able to cope with nonnumeric data and missing data.

GLRM is also being suggested as a lossy compression algorithm (to reduce storage requirements), and, related to that, as a way to fill in missing values. I will look at that at the end of this chapter. But here I am going to run it on the same NLP movie data as I did with auto-encoder and PCA:

library(h2o)
h2o.init(nthreads = -1)

tfidf <- h2o.importFile("./datasets/movie.tfidf.csv")

m <- h2o.glrm(tfidf, cols = 2:564, k = 2)
X <- h2o.getFrame(m@model$representation_name)
# Y <- m@model$archetypes

GLRM works by taking the 563 column by 100 row data, and creating two smaller matrices: X, which is 2 columns by 100 rows, and Y, which is 563 columns by 2 rows. That is, 56,300 cells have been reduced to 200 + 1126 = 1326 cells. Y is commented out here, as it not being used in this example. To restore your original data (also not needed here) you would do X * Y. You can do this with h2o.proj_archetypes(m, tfidf) or h2o.predict(m, tfidf) or h2o.reconstruct(m, tfidf).9

As in the previous sections we can plot the contents of X, and attach movie names (Figure 9-6).

Note

GLRM contains a number of options, including transform. The default is “NONE,” but if you change this default then you should also set reverse_transform to be true when calling h2o.proj_archetypes. For example, h2o.proj_archetypes(m, tfidf, reverse_transform = TRUE).

pmlh 0906
Figure 9-6. Movies in two dimensions, made by GLRM

Missing Data

The building energy and MNIST data sets came to us perfectly formed: no missing data at all. The same cannot be said for the football data. It was hard enough just to get it all into a single file, but at the point we left it (at the end of Chapter 3) there were quite a lot of NAs (the early years had no stats, and the set of bookmakers that we get odds from changes year to year). If we run GLM or deep learning on it, with missing data handling set to “Skip,” then any row that has an NA in any column will get ignored completely.

The missing fields in our train, valid, and test data sets are different. You can see number of missing values by looking at the data on Flow, for instance, but to investigate this issue more deeply, I loaded the data into H2O (see Example 3-6 from Chapter 3) to set up train, test, valid, x, y, and so on, then I ran the following lines to download all the data into the R client:

d <- as.data.frame(train)
dv <- as.data.frame(valid)
dt <- as.data.frame(test)

R has a nice couple of idioms to help. First, to find out how many rows we have with no NAs in any column, use sum(complete.cases(d)). 15,648 in training, 1,984 in the validation data set, and only 310 in the training data. mean(complete.cases(d)) gets that as a percentage: 38%, 97%, and 15%. That is a lot of data we could potentially be throwing away, especially in the test data set.

Second, to see what percentage of each column is a missing value, use colMeans(is.na(d)). Here is a sample:

   Div       Date   HomeTeam   AwayTeam       FTHG       FTAG
 0.000      0.000      0.000      0.000      0.000      0.000
   HTR         HS         AS        HST        AST         HF
 0.262      0.350      0.350      0.350      0.350      0.350
    HY         AY         HR         AR      B365H      B365D
 0.350      0.350      0.350      0.350      0.450      0.450
  ...        ...        ...        ...        ...        ...
 BbAvH      BbMxD      BbAvD      BbMxA      BbAvA       BbOU
 0.600      0.600      0.600      0.600      0.600      0.600
BbAv<2.5     BbAH      BbAHh    BbMxAHH    BbAvAHH    BbMxAHA
 0.600      0.600      0.600      0.600      0.600      0.600
  HST1       AST1        HF1        AF1        HC1        AC1
 0.000      0.000      0.000      0.000      0.000      0.000
   AR1      res1H      res1A      res5H      res5A     res20H
 0.000      0.000      0.000      0.000      0.000      0.000

In the training set, all the columns starting “Bb” are 60% missing. The other betting odds columns vary from 35% to 55% missing. 34% of rows have no match stats (number of corners, etc.), and 26% don’t have the half-time result.

The validation data is completely different: dv[!complete.cases(dv),'Date'] tells me that there were 45 matches affected in mid-August, and that 3 more matches were affected in April, i.e., just 2.3%, and just betting odd columns. The test data is different again: 85% of columns SJH, SJD, SJA are missing.

The test data is the easiest to fix: if we remove the SJH, SJD, and SJA columns, we end up with 2032 complete cases. So it jumps from 15% complete to 99.8% complete! The way to remove columns in H2O is by doing a copy, specifying the columns we want to keep.10 Look at this code, but don’t run it just yet:

test <- test[!(colnames(test) %in% c('SJH', 'SJD', 'SJA'))]

Naturally, if you remove some columns from the test data set, you need to remove those same columns from the training and validation data sets. But we need to think what to do about all the missing data. There have been entire books written about missing data, entire conferences on the subject, so brace yourself, because I’m about to reduce it to two techniques:

  • Throw It Out

  • Make It Up

You just saw an example of Throw It Out, when we got rid of the entire SJH/SJD/SJA columns. The “Skip” behavior of GLM and deep learning, which ignores data rows with any NA, is another example.

The very simplest approach to Make It Up is to set it to zero. I did that, kind of inadvertently, when adding the previous-match stats (see “The Other Third”). Figure 9-7 shows the histogram of HS (home shots in each match) on the left, with HS1 (shots by the home side in their previous match), on the right. Have I done a bad thing here? Maybe. But before you shun me, ostracize me, shut me out of your life forever, we should consider the alternatives.

pmlh 0907
Figure 9-7. Comparison of HS and HS1 distribution (with zero)

One step up the sophisticated scale is to take the mean of the column and replace all missing values with that. And GLM and deep learning will do this for you, for all NAs, if you specify missing_values_handling = "MeanImputation". There is also the h2o.impute() function, which offers not just mean, but also median and mode options. (Imputation is what statisticians call making things up.) Surely that is going to be better than using a zero? Figure 9-8 is what mean imputation looks like for that same HS1 field.

pmlh 0908
Figure 9-8. Comparison of HS and HS1 distribution (with mean imputation)

I’d argue that in some situations this is just as bad. What I mean is that for algorithms like GLM, which will use HS1 as a numeric field in mathematical equations, yes, the second way is better. But for tree algorithms, that cut the numbers up into ranges, the second way has disguised the difference between a genuine 12 and a “shrug, no idea what actually happened.” Using a zero is better for the tree algorithms, as zero was a rare value.

Both approaches, zero or mean, are poor.11 The ideal would be something that kept the shape of the histogram. One way that might work, with HS1, is to guess that if a team scored zero goals they likely made fewer shots than a team that scored one goal, while a team that scored two goals likely made more shots, and so on. A quick check shows a 0.23 correlation between home-side goals and home-side shots, and a 0.37 correlation between home-side goals and home-side shots on target (the “HST” field).

With up to 60% values missing, it is the betting odds columns that are causing the most anguish. New bookmaker sources get added, bookmakers go broke, or merge, and generally it all gets horribly messy. But, every cloud has a silver lining, and the Ag layer here is that all those bookmaker odds are highly correlated. “Estimating HS1 based on goals scored” is a level of making stuff up that gives even a politician pause, but if the odds of a home win from our other bookmakers range from 1.35 to 1.39, for a certain match, we are going to be fairly safe going with 1.37 for any missing bookmaker values.12

GLRM

Unhappy with both “Skip” and “MeanImputation” options, I first tried the GLRM algorithm to fill in the data. My first try was a very naive h2o.glrm(train, k = 9). It took so long I had to abort it, and found out that it was trying to work with 3830 columns! Each unique date, and each unique team name, had become a column. So I tried again, using just x, the list of column names we can validly use to learn a model from.

Objective is the measure of error in GLRM, and the default 50 iterations gave an objective of 2.76 million, and completed in 12.4 seconds; by increasing to 200 iterations it reduced the objective 839K, and increased the run time to 45 seconds. Here is the code to make the model, and then to make a version of train with no missing values:

m <- h2o.glrm(train, cols = x,
  k = 9, max_iterations = 200
  )
train2 <- h2o.reconstruct(m, train)

train2 just contains the x columns, and the column names are all different, so it would take some data hacking to merge these in to replace just the missing values in train. But the real problem with train2 is the values are outside the range of values in the original data. For instance, betting odds always have to be above 1.0. But some of the restored betting odds values were not just below 1.0 but were even negative. Other fields had a range of –1.0 to +1.0, but were being given values outside that range.

I made a number of attempts to use GLRM’s various parameters, or to try just using the odds columns, or different values for k, or more iterations, but couldn’t get past this fundamental flaw.

Lose the R!

I got much better results when I switched from using GLRM to using GLM. The idea is, for any given column, make a linear model to predict it based on the value in all the other columns. There are 39 columns that have at least one missing value, so this requires making 39 linear models.

As a first step, I decided to drop all data prior to the 2000/2001 season. It only consisted of the final result, no match stats, no betting odds, so very little to impute off of. That was done with this code, which creates a new data frame on the H2O cluster, and also gives it a friendly name:

train2000 <- h2o.assign(train[14237:nrow(train),], "train2000")

The following R loop shows how the fields storing betting odds were filled in. Because these columns are so highly correlated, I just trained each line model from the other odds columns and nothing else. That is what the setdiff(oddFields, y) statement is doing:

dNew <- as.data.frame(train2000)
colnames(dNew) <- colnames(train2000)

lapply(oddFields, function(y){
  missingCount = sum(is.na(dNew[,y]))
  if(missingCount == 0)return(NULL)

  m <- h2o.glm(setdiff(oddFields, y), y,
    train2000, model_id = paste0("GLM_",y),
    lambda_search = TRUE
    )
  res <- h2o.predict(m, train2000[is.na(train2000[,y]),])
  v <- as.vector(res)

  dNew <- get('dNew', envir = .GlobalEnv)
  dNew[is.na(dNew[,y]),y] <- as.vector(res)
  assign('dNew', dNew, envir=.GlobalEnv)
  })
train2000x <- as.h2o(dNew, destination_frame = "train2000x")

The first couple of lines prepare a client-side data frame to store the imputed data in, and the last three lines in the loop are some R hackery to insert results, replacing just the missing values.13

The final line uploads the filled-in data to the H2O cluster, which is essential so we can reference it in the next block of code, which will fill in gaps in the other fields. It allows using the imputed odds data as a field to learn from when filling in the stats fields. Other than that, the following loop works just like the previous one:

mostFields <- setdiff(colnames(train), c("Date", "HomeTeam", "AwayTeam") )
lapply(statFields, function(y){
  missingCount <- sum(is.na(dNew[,y]))
  if(missingCount == 0)return(NULL)

  m <- h2o.glm(setdiff(mostFields, y), y,
    train2000x, model_id = paste0("GLM_",y),
    lambda_search = TRUE
    )
  res <- h2o.predict(m, train2000x[is.na(train2000x[,y]),])
  v <- as.vector(res)

  dNew <- get('dNew', envir = .GlobalEnv)
  dNew[is.na(dNew[,y]),y] <- as.vector(res)
  assign('dNew', dNew, envir=.GlobalEnv)
  })
train2000x <- as.h2o(dNew, destination_frame = "train2000x")

To get higher-quality results, valid2000x was made by merging the new training data with the current validation data:

trainValid <- h2o.rbind(train2000x,valid)

The two loops were the same. And the test data was filled in in just the same way, using the merger of three data sets:

trainValidTest <- h2o.rbind(train2000x, valid2000x, test)
Warning

Why not just merge train, valid, and test at the start, and run the loops once, instead of three times? Because that would be infecting the training data with knowledge of the future. When dealing with time-series data you need to keep thinking what data was available at what point in time, and remember that the test data represents unseen data that your model will be used on in production.

The final step was to export those data frames to csv files:

path = "/path/to/datasets/"
h2o.exportFile(train2000x, paste0(path",football.train2.csv"))
h2o.exportFile(valid2000x, paste0(path",football.valid2.csv"))
h2o.exportFile(test2000x, paste0(path",football.test2.csv"))

Unlike importing data, relative paths are not allowed by exportFile(), so a full path must be given.

Tip

I used GLM almost out-of-the-box, the only customization being to specify lambda search (and even that was not needed). Of course, it didn’t need to be GLM, and I imagine any of random forest, GBM, or deep learning would have done the job just as well, if not better.

But GLM was good enough, and is quick, and stays quick when scaled across clusters.

To close this section, the comparison of HS and HS1 in football.train2.csv is shown in Figure 9-9. The distributions look almost exactly the same!

pmlh 0909
Figure 9-9. The final imputed HS1 data

See I told you it would all work out in the end. (Smug grin.) Where did all the zeros go, do I hear you cry? It turns out that practically all of them were in the pre-2000 data. So, that huge imbalance went away when I truncated away the first 14,237 rows. I know, it feels like cheating. But, all’s well that ends well!

Summary

This chapter showed how to use H2O with Natural Language Processing, but was mainly about dealing with data when the correct answer is either unavailable or does not exist. Sometimes this is a means to an end, for instance when creating better or additional training data for a supervised learning algorithm, and sometimes it is the end in itself, such as clustering or filling in missing values.

This chapter also took a detailed look at dealing with missing data, and you should now know when to specify the missing_values_handling parameter, when you need to do something as part of the data preparation stage, and when you don’t need to do anything. We also saw how H2O can be of use at any stage in your pipeline, not just to build the big model at the end.

We have now looked at four supervised learning algorithms, and a variety of unsupervised ones. The next chapter will take a quick look at everything in H2O that has not already been dealt with.

1 Each movie synopsis counts as a document; the corpus is the set of all 100 movie synopses.

2 See https://en.wikipedia.org/wiki/K-means%2B%2B for a description of this initialization algorithm.

3 They will change each time you run it, unless you set a seed (123 used here).

4 The combination of stopwords and stemming seemed to give some strange terms. Doing some proper grammatical parsing would improve results, though, would also give a huge increase in computation time. But all that is outside the scope of this book.

5 Maxout is not supported for auto-encoding.

6 Stacked auto-encoders usually model a single layer at a time; I wanted to show here that you don’t have to do it that way.

7 If you go back to Brandon Rose’s article, and code, you will see the genre of each movie is available. Could that be used for supervised learning?

8 It builds on top of k-means: in fact if you look on Flow you will see that for every GLRM model a k-means model has also been built.

9 Yes, it is strange you need the original data tfidf, when the point of X and Y was that they can replace it, and so free up storage. Also strange that there appears to be three functions to do the same thing.

10 Remember to then do dt <- as.data.frame(test) to get the data again, if you plan on any more client-side analysis with it.

11 I should’ve used a blank value instead of zero; then H2O would have loaded them as missing values, and they would not have appeared in the histogram at all. But, stay with me, it will all work out for the best.

12 That isn’t mean imputation. Mean imputation is the mean over the whole column. This is the average of selected fields over a row.

13 I was able to do all these columns in one go because it would fit in memory. If you are dealing with Bigger Data, this might have to be done one column at a time.

14 The bad choices here are very obvious. They won’t always be.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.15.154