Take a look at Figure 9-1, and tell me what you see.
It doesn’t matter what the x- and y-axes are in those two plots; imagine each plot represents a scientific domain. For the chart on the left you likely identified about 10 different concepts, so that domain is likely to have about 10 specialist words; some are blurring into each other, so maybe 9, maybe 11 or 12? But the domain on the right has two concepts, and is only going to have two specialist words. Or they might be describing weather conditions over a year. The chart on the left might be describing a place with lots of distinct weather patterns, so the weather becomes a talking point, and lots of weather phrases enter the vocabulary (“Is it chucking it down outside?” “No, just a light drizzle”). The chart on the right might represent the climate of Southern California, where only two weather phrases are needed (“lovely and sunny,” and “slightly cloudy”).
The point is, you didn’t need to know the subject or the “correct answer” to be able to do something useful with the data you were given, and this is a core strength of human intelligence. In machine learning, it is called unsupervised learning, and this chapter will look at some of the functionality H2O has for it.
This automatic organization of the data can be thought as a form of data compression. If you have 5000 input columns, that are quite sparse and contain lots of duplication, you might use the techniques in this chapter to reduce them to a more manageable, and information-dense, 12 columns; then you can use one of the supervised learning techniques on those 12 columns.
The idea behind k-means is to divide your data up into k groups (you have to specify k) such that each data item is closer to the center of its cluster than to the center of any other cluster. It is doing what I asked you to do in those scatterplots earlier.
For this section I am going to use a Natural Language Processing (NLP) example. But as space is limited, and I want to keep this chapter focused, I am going to take the tf-idf (more on that in a moment) data that someone else has made, and direct you to their article and GitHub site; see the following sidebar.
The csv file has 564 columns: the first one is the name of the movie, the other 563 are the terms that were extracted. I want to use those 563 values to divide the movies up into 5 clusters; if we get lucky the set of movies in each cluster will be similar to each other. Example 9-1 shows how to do that in R. Most of it should be familiar by now; tapply
is an R function to group the values in one column (movie names) by another column (each movie’s k-means group), then apply the given function (print) to each group.
library
(
h2o
)
h2o.init
(
nthreads
=
-1
)
tfidf
<-
h2o.importFile
(
"./datasets/movie.tfidf.csv"
)
m
<-
h2o.kmeans
(
tfidf
,
x
=
2
:
564
,
k
=
5
,
standardize
=
FALSE
,
init
=
"PlusPlus"
)
p
<-
h2o.predict
(
m
,
tfidf
)
tapply
(
as.vector
(
tfidf
[,
1
]),
as.vector
(
p
$
predict
),
)
Example 9-2 is how to do that in Python; see the inline comments.
import
h2o
h2o
.
init
()
tfidf
=
h2o
.
import_file
(
"./datasets/movie.tfidf.csv"
)
from
h2o.estimators.kmeans
import
H2OKMeansEstimator
m
=
H2OKMeansEstimator
(
k
=
5
,
standardize
=
False
,
init
=
"PlusPlus"
)
m
.
train
(
x
=
range
(
1
,
564
),
training_frame
=
tfidf
)
#Get the group that each movie is in
p
=
m
.
predict
(
tfidf
)
#Join that to our movie names, then download it
d
=
tfidf
[
0
]
.
cbind
(
p
)
.
as_data_frame
()
d
.
columns
=
[
"movie"
,
"group"
]
#Iterate through and print each group
for
ix
,
g
in
d
.
groupby
(
"group"
):
"---"
,
ix
,
"---"
', '
.
join
(
g
[
"movie"
])
I set a couple of optional parameters. First init="PlusPlus"
,2 which I felt gave better results than the default of “Furthest,” or the other alternative, “Random.” (You can also specify your own initialization values.) I also set standardize
to false, because the data is already nicely between 0.0 and 1.0. The m
object tells you quite a lot of information, including how many items are in each cluster, but if you want to find out which item is in which cluster you have to ask it to predict them!
It runs quickly, a matter of seconds. Here is the first of the five groups:3
Schindler's List One Flew Over Cuckoo Nest Gone with the Wind The Wizard of Oz Lawrence of Arabia Forrest Gump E.T. the Extra-Terrestrial LOTR: Return of the King Gladiator Saving Private Ryan Raiders of the Lost Ark Streetcar Named Desire Best Years of Our Lives My Fair Lady Ben-Hur Doctor Zhivago Platoon The Pianist The Exorcist The Deer Hunter All Quiet on Western Front Mr. Smith Goes Washington Terms of Endearment The Grapes of Wrath Shane The Green Mile Close Encounters 3rd Kind The Graduate Stagecoach A Clockwork Orange Wuthering Heights
Here is the second group:
Raging Bull Citizen Kane Singin' in the Rain 12 Angry Men Amadeus Gandhi Rocky To Kill a Mockingbird Braveheart Dances with Wolves City Lights Good Will Hunting Network
And the third group:
It's a Wonderful Life Philadelphia Story American in Paris Patton The King's Speech A Place in the Sun Out of Africa Tootsie Giant Nashville Yankee Doodle Dandy
The fourth group has both Godfather movies in it; most of the Westerns seem to be in here too:
The Godfather The Shawshank Redemption Casablanca Titanic The Godfather: Part II Psycho Sunset Blvd. Vertigo On the Waterfront West Side Story The Silence of the Lambs Chinatown Some Like It Hot Unforgiven Good, Bad and Ugly Butch Cassidy & Sundance Treasure of Sierra Madre The Apartment High Noon Goodfellas The French Connection It Happened One Night Midnight Cowboy Rain Man Annie Hall Fargo American Graffiti Pulp Fiction The Maltese Falcon Taxi Driver Double Indemnity Rebel Without Cause Rear Window The Third Man North by Northwest
And the fifth has some scary ones like Jaws and The Sound of Music:
The Sound of Music Star Wars 2001: A Space Odyssey Bridge on the River Kwai Dr. Strangelove Apocalypse Now From Here to Eternity Jaws The African Queen Mutiny on the Bounty
I’m sure you can see how this can form the basis of a “People who enjoyed Saving Private Ryan also enjoyed Platoon" movie recommendation system, but the fact that the five groups change so much from run to run makes me cautious. It might just be that the input data needs more work.4
We previously had a whole chapter (Chapter 8) on using h2o.deeplearning()
for supervised learning, but now we will look at it for unsupervised learning. You switch it into this mode by setting autoencoder
to true. The other difference is to not set the y
argument (i.e., the field you want to learn, in supervised learning).
It is a bit of a trick: it still does supervised learning, but it copies your input layer to be the output layer (aka “the answer”). In other words, it tries to learn the inputs. That might sound a bit pointless, but what is happening is that the hidden layers are being forced to summarize the data, to compress it. All the tuning knowledge we learned in the earlier chapter can be applied.
As an example of its use, I am going to take the same NLP data set that was used in “K-Means Clustering”, where we have 100 movies, and 563 terms, and see if we can reduce those 563 dimensions to just two dimensions. That is quite an ask, but I have chosen two because then I can plot the results.
The nice thing about this data set is there are only 100 rows, so the experiments can be quite quick. Example 9-3 is almost the simplest possible auto-encoder: there will be 563 input neurons, going down to just two neurons in a single hidden layer, then going to 563 output neurons. It will train for the default 10 epochs.
m
<-
h2o.deeplearning
(
2
:
564
,
training_frame
=
tfidf
,
hidden
=
c
(
2
),
autoencoder
=
T
,
activation
=
"Tanh"
)
f
<-
h2o.deepfeatures
(
m
,
tfidf
,
layer
=
1
)
I said almost: I’ve added activation = "Tanh"
instead of using the default Rectifier. You can use the default Rectifier, but:
It gives poor results: x or y will be zero for many of them, giving a clustering along the bottom and left.
In more complex auto-encoders you will get complaints of numerical instability.
So, with auto-encoders I recommend always using Tanh.5
The code in Python (Example 9-4) is quite similar, though note that layers (and column indices) count from zero, whereas in R they counted from one.
m
=
h2o
.
estimators
.
deeplearning
.
H2OAutoEncoderEstimator
(
hidden
=
[
2
],
activation
=
"Tanh"
)
m
.
train
(
x
=
range
(
1
,
564
),
training_frame
=
tfidf
)
f
=
m
.
deepfeatures
(
tfidf
,
layer
=
0
)
Despite the simplicity, it gives acceptable results, with an MSE of 0.035. Figure 9-2 shows the movies plotted by the value of the two reduced dimensions.
In this series of plots only the first 30 movies are plotted each time, to stop them looking too cluttered. You will see overlapping names. These are not printing errors! It is where the algorithm has not managed to separate those two movies in the meager two dimensions it has been given.
By the way, I am plotting the first 30 with R code like this:
d
<-
as.matrix
(
f
[
1
:
30
,])
labels
<-
as.vector
(
tfidf
[
1
:
30
,
1
])
plot
(
d
,
pch
=
17
)
#Triangle
text
(
d
,
labels
,
pos
=
3
)
#pos=3 means above
Next I made 19 models, where I experimented with the amount of reduction: between 2 and 20 dimensions. For each model, hidden
was set to 128,64,nodes,64,128
, where nodes
ranged from 2 to 20. So 563 input nodes linked to the first hidden layer with 128 nodes, then to 64 nodes in the second hidden layer, then the 2+ nodes of interest are the third hidden layer, then back up to 64, then up to 128 in the fifth and final hidden layer, and then finally out to 563 output nodes. I first tried this with 5 epochs (the upper line in Figure 9-3), and then again with 400 epochs (the mostly lower line).
The more dimensions in the middle hidden layer the easier it is to learn, but the steeper angle of the lower curve means it needs plenty of epochs to take advantage of this. It is curious that the first four values did worse with more epochs.
A neural net model can be built up in stages, which is a useful technique, even with supervised learning, when you find you want more layers but learning is too slow or just not working.
Terminology alert: You might also see stacking models used to mean ensembles (described in Chapter 10). But with auto-encoding neural networks it means learning one layer at a time.
Staying with the movie NLP data set:
m1
=
h2o
.
estimators
.
deeplearning
.
H2OAutoEncoderEstimator
(
hidden
=
[
128
,
64
,
11
,
64
,
128
],
activation
=
"Tanh"
,
epochs
=
400
)
m1
.
train
(
x
=
range
(
1
,
564
),
training_frame
=
tfidf
)
f1
=
m1
.
deepfeatures
(
tfidf
,
layer
=
2
)
m2
=
h2o
.
estimators
.
deeplearning
.
H2OAutoEncoderEstimator
(
hidden
=
[
2
],
activation
=
"Tanh"
,
epochs
=
400
)
m2
.
train
(
x
=
range
(
0
,
11
),
training_frame
=
f1
)
f2
=
m2
.
deepfeatures
(
f1
,
layer
=
0
)
What is happening is that a model m1
with 5 hidden layers, and 11 neurons in the middle layer, is trained on the raw data, tfidf
.6 Then that third hidden layer is extracted into f1
. f1
is a transformation of tfidf
, still with 100 rows, but now only 11 columns. m2
then uses f1
as its input, and it builds a much simpler model, reducing 11 input nodes to 2 hidden nodes then back out to 11 output nodes. At the end the results are put in f2
. f2
has 2 columns, but still has 100 rows, one for each movie.
I can’t objectively say if Figure 9-4 looks any better: this is unsupervised learning, after all! But the key point here is that f1
could have been used as the training frame into any algorithm, whether another auto-encoder, as here, or a supervised learning algorithm such as random forest.7
A common example you might see is to auto-encode the MNIST data set, into two columns, which you then plot to show the digits (normally color-coded). If you have got it right you should see each digit clustering together nicely. It is also interesting to see the different clustering you get when using PCA (see “Principal Component Analysis”) to reduce the same data to two dimensions.
You might then train on just those two columns. Or add them to the data, just like the enhanced data that was added before (see “Helping the Models”).
Principal component analysis is normally called PCA, though the H2O API and R call it prcomp. It is another way to reduce the dimensionality of numeric data. Wikipedia tells me “PCA can be done by eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix.” Fortunately I don’t need to know my eigenvalue from my eigenvector to be able to use it.
I am going to call PCA with k = 2
, meaning I just want to get the first two principal components. The first one will be the x-axis in my plot, and accounts for as much variability in the data as it can. The second one, which will become my y-axis, gets as much of the remaining variability as it can, while being orthogonal to the first principal component. So, with PCA each additional dimension brings along less information. This is in contrast to using an auto-encoder to reduce dimensions—there the dimensions are all equal citizens. (See Figure 11-4 in Chapter 11 for a visual example of the difference.)
The data, and application, is the same as shown in “Deep Learning Auto-Encoder”: take the tf-idf scores for 563 terms used to describe 100 movies, then reduce the 563 dimensions to just two, so that they can be plotted.
Here is the complete code, in R:
library
(
h2o
)
h2o.init
(
nthreads
=
-1
)
tfidf
<-
h2o.importFile
(
"./datasets/movie.tfidf.csv"
)
m
<-
h2o.prcomp
(
tfidf
,
2
:
564
,
k
=
2
)
p
<-
h2o.predict
(
m
,
tfidf
)
After running that code p
will have 2 columns and 100 rows, and plotted it looks like Figure 9-5 (again, just the top 30 movies, to avoid it getting overly messy).
Despite a few more overlaps, it looks just as plausible as any of the other plots of this data: the two Godfather movies are quite close, as are The Wizard of Oz and Silence of the Lambs.
GLRM stands for Generalized Low Rank Model, and it’s another algorithm for reducing the number of columns, while maintaining as much information as possible.8 The additional thing that GLRM brings is being able to cope with nonnumeric data and missing data.
GLRM is also being suggested as a lossy compression algorithm (to reduce storage requirements), and, related to that, as a way to fill in missing values. I will look at that at the end of this chapter. But here I am going to run it on the same NLP movie data as I did with auto-encoder and PCA:
library
(
h2o
)
h2o.init
(
nthreads
=
-1
)
tfidf
<-
h2o.importFile
(
"./datasets/movie.tfidf.csv"
)
m
<-
h2o.glrm
(
tfidf
,
cols
=
2
:
564
,
k
=
2
)
X
<-
h2o.getFrame
(
m
@
model
$
representation_name
)
# Y <- m@model$archetypes
GLRM works by taking the 563 column by 100 row data, and creating two smaller matrices: X
, which is 2 columns by 100 rows, and Y
, which is 563 columns by 2 rows. That is, 56,300 cells have been reduced to 200 + 1126 = 1326 cells. Y
is commented out here, as it not being used in this example. To restore your original data (also not needed here) you would do X * Y
. You can do this with h2o.proj_archetypes(m, tfidf)
or h2o.predict(m, tfidf)
or h2o.reconstruct(m, tfidf)
.9
As in the previous sections we can plot the contents of X
, and attach movie names (Figure 9-6).
The building energy and MNIST data sets came to us perfectly formed: no missing data at all. The same cannot be said for the football data. It was hard enough just to get it all into a single file, but at the point we left it (at the end of Chapter 3) there were quite a lot of NAs (the early years had no stats, and the set of bookmakers that we get odds from changes year to year). If we run GLM or deep learning on it, with missing data handling set to “Skip,” then any row that has an NA in any column will get ignored completely.
The missing fields in our train, valid, and test data sets are different. You can see number of missing values by looking at the data on Flow, for instance, but to investigate this issue more deeply, I loaded the data into H2O (see Example 3-6 from Chapter 3) to set up train
, test
, valid
, x
, y
, and so on, then I ran the following lines to download all the data into the R client:
d
<-
as.data.frame
(
train
)
dv
<-
as.data.frame
(
valid
)
dt
<-
as.data.frame
(
test
)
R has a nice couple of idioms to help. First, to find out how many rows we have with no NAs in any column, use sum(complete.cases(d))
. 15,648 in training, 1,984 in the validation data set, and only 310 in the training data. mean(complete.cases(d))
gets that as a percentage: 38%, 97%, and 15%. That is a lot of data we could potentially be throwing away, especially in the test data set.
Second, to see what percentage of each column is a missing value, use colMeans(is.na(d))
. Here is a sample:
Div Date HomeTeam AwayTeam FTHG FTAG 0.000 0.000 0.000 0.000 0.000 0.000 HTR HS AS HST AST HF 0.262 0.350 0.350 0.350 0.350 0.350 HY AY HR AR B365H B365D 0.350 0.350 0.350 0.350 0.450 0.450 ... ... ... ... ... ... BbAvH BbMxD BbAvD BbMxA BbAvA BbOU 0.600 0.600 0.600 0.600 0.600 0.600 BbAv<2.5 BbAH BbAHh BbMxAHH BbAvAHH BbMxAHA 0.600 0.600 0.600 0.600 0.600 0.600 HST1 AST1 HF1 AF1 HC1 AC1 0.000 0.000 0.000 0.000 0.000 0.000 AR1 res1H res1A res5H res5A res20H 0.000 0.000 0.000 0.000 0.000 0.000
In the training set, all the columns starting “Bb” are 60% missing. The other betting odds columns vary from 35% to 55% missing. 34% of rows have no match stats (number of corners, etc.), and 26% don’t have the half-time result.
The validation data is completely different: dv[!complete.cases(dv),'Date']
tells me that there were 45 matches affected in mid-August, and that 3 more matches were affected in April, i.e., just 2.3%, and just betting odd columns. The test data is different again: 85% of columns SJH, SJD, SJA are missing.
The test data is the easiest to fix: if we remove the SJH, SJD, and SJA columns, we end up with 2032 complete cases. So it jumps from 15% complete to 99.8% complete! The way to remove columns in H2O is by doing a copy, specifying the columns we want to keep.10 Look at this code, but don’t run it just yet:
test
<-
test
[
!
(
colnames
(
test
)
%in%
c
(
'SJH'
,
'SJD'
,
'SJA'
))]
Naturally, if you remove some columns from the test data set, you need to remove those same columns from the training and validation data sets. But we need to think what to do about all the missing data. There have been entire books written about missing data, entire conferences on the subject, so brace yourself, because I’m about to reduce it to two techniques:
Throw It Out
Make It Up
You just saw an example of Throw It Out, when we got rid of the entire SJH/SJD/SJA columns. The “Skip” behavior of GLM and deep learning, which ignores data rows with any NA, is another example.
The very simplest approach to Make It Up is to set it to zero. I did that, kind of inadvertently, when adding the previous-match stats (see “The Other Third”). Figure 9-7 shows the histogram of HS (home shots in each match) on the left, with HS1 (shots by the home side in their previous match), on the right. Have I done a bad thing here? Maybe. But before you shun me, ostracize me, shut me out of your life forever, we should consider the alternatives.
One step up the sophisticated scale is to take the mean of the column and replace all missing values with that. And GLM and deep learning will do this for you, for all NAs, if you specify missing_values_handling = "MeanImputation"
. There is also the h2o.impute()
function, which offers not just mean, but also median and mode options. (Imputation is what statisticians call making things up.) Surely that is going to be better than using a zero? Figure 9-8 is what mean imputation looks like for that same HS1 field.
I’d argue that in some situations this is just as bad. What I mean is that for algorithms like GLM, which will use HS1 as a numeric field in mathematical equations, yes, the second way is better. But for tree algorithms, that cut the numbers up into ranges, the second way has disguised the difference between a genuine 12 and a “shrug, no idea what actually happened.” Using a zero is better for the tree algorithms, as zero was a rare value.
Both approaches, zero or mean, are poor.11 The ideal would be something that kept the shape of the histogram. One way that might work, with HS1, is to guess that if a team scored zero goals they likely made fewer shots than a team that scored one goal, while a team that scored two goals likely made more shots, and so on. A quick check shows a 0.23 correlation between home-side goals and home-side shots, and a 0.37 correlation between home-side goals and home-side shots on target (the “HST” field).
With up to 60% values missing, it is the betting odds columns that are causing the most anguish. New bookmaker sources get added, bookmakers go broke, or merge, and generally it all gets horribly messy. But, every cloud has a silver lining, and the Ag layer here is that all those bookmaker odds are highly correlated. “Estimating HS1 based on goals scored” is a level of making stuff up that gives even a politician pause, but if the odds of a home win from our other bookmakers range from 1.35 to 1.39, for a certain match, we are going to be fairly safe going with 1.37 for any missing bookmaker values.12
Unhappy with both “Skip” and “MeanImputation” options, I first tried the GLRM algorithm to fill in the data. My first try was a very naive h2o.glrm(train, k = 9)
. It took so long I had to abort it, and found out that it was trying to work with 3830 columns! Each unique date, and each unique team name, had become a column. So I tried again, using just x
, the list of column names we can validly use to learn a model from.
Objective is the measure of error in GLRM, and the default 50 iterations gave an objective of 2.76 million, and completed in 12.4 seconds; by increasing to 200 iterations it reduced the objective 839K, and increased the run time to 45 seconds. Here is the code to make the model, and then to make a version of train
with no missing values:
m
<-
h2o.glrm
(
train
,
cols
=
x
,
k
=
9
,
max_iterations
=
200
)
train2
<-
h2o.reconstruct
(
m
,
train
)
train2
just contains the x
columns, and the column names are all different, so it would take some data hacking to merge these in to replace just the missing values in train
. But the real problem with train2
is the values are outside the range of values in the original data. For instance, betting odds always have to be above 1.0. But some of the restored betting odds values were not just below 1.0 but were even negative. Other fields had a range of –1.0 to +1.0, but were being given values outside that range.
I made a number of attempts to use GLRM’s various parameters, or to try just using the odds columns, or different values for k
, or more iterations, but couldn’t get past this fundamental flaw.
I got much better results when I switched from using GLRM to using GLM. The idea is, for any given column, make a linear model to predict it based on the value in all the other columns. There are 39 columns that have at least one missing value, so this requires making 39 linear models.
As a first step, I decided to drop all data prior to the 2000/2001 season. It only consisted of the final result, no match stats, no betting odds, so very little to impute off of. That was done with this code, which creates a new data frame on the H2O cluster, and also gives it a friendly name:
train2000
<-
h2o.assign
(
train
[
14237
:
nrow
(
train
),],
"train2000"
)
The following R loop shows how the fields storing betting odds were filled in. Because these columns are so highly correlated, I just trained each line model from the other odds columns and nothing else. That is what the setdiff(oddFields, y)
statement is doing:
dNew
<-
as.data.frame
(
train2000
)
colnames
(
dNew
)
<-
colnames
(
train2000
)
lapply
(
oddFields
,
function
(
y
){
missingCount
=
sum
(
is.na
(
dNew
[,
y
]))
if
(
missingCount
==
0
)
return
(
NULL
)
m
<-
h2o.glm
(
setdiff
(
oddFields
,
y
),
y
,
train2000
,
model_id
=
paste0
(
"GLM_"
,
y
),
lambda_search
=
TRUE
)
res
<-
h2o.predict
(
m
,
train2000
[is.na
(
train2000
[,
y
]),])
v
<-
as.vector
(
res
)
dNew
<-
get
(
'dNew'
,
envir
=
.GlobalEnv
)
dNew
[is.na
(
dNew
[,
y
]),
y
]
<-
as.vector
(
res
)
assign
(
'dNew'
,
dNew
,
envir
=
.GlobalEnv
)
})
train2000x
<-
as.h2o
(
dNew
,
destination_frame
=
"train2000x"
)
The first couple of lines prepare a client-side data frame to store the imputed data in, and the last three lines in the loop are some R hackery to insert results, replacing just the missing values.13
The final line uploads the filled-in data to the H2O cluster, which is essential so we can reference it in the next block of code, which will fill in gaps in the other fields. It allows using the imputed odds data as a field to learn from when filling in the stats fields. Other than that, the following loop works just like the previous one:
mostFields
<-
setdiff
(
colnames
(
train
),
c
(
"Date"
,
"HomeTeam"
,
"AwayTeam"
)
)
lapply
(
statFields
,
function
(
y
){
missingCount
<-
sum
(
is.na
(
dNew
[,
y
]))
if
(
missingCount
==
0
)
return
(
NULL
)
m
<-
h2o.glm
(
setdiff
(
mostFields
,
y
),
y
,
train2000x
,
model_id
=
paste0
(
"GLM_"
,
y
),
lambda_search
=
TRUE
)
res
<-
h2o.predict
(
m
,
train2000x
[is.na
(
train2000x
[,
y
]),])
v
<-
as.vector
(
res
)
dNew
<-
get
(
'dNew'
,
envir
=
.GlobalEnv
)
dNew
[is.na
(
dNew
[,
y
]),
y
]
<-
as.vector
(
res
)
assign
(
'dNew'
,
dNew
,
envir
=
.GlobalEnv
)
})
train2000x
<-
as.h2o
(
dNew
,
destination_frame
=
"train2000x"
)
To get higher-quality results, valid2000x
was made by merging the new training data with the current validation data:
trainValid
<-
h2o.rbind
(
train2000x
,
valid
)
The two loops were the same. And the test data was filled in in just the same way, using the merger of three data sets:
trainValidTest
<-
h2o.rbind
(
train2000x
,
valid2000x
,
test
)
Why not just merge train
, valid
, and test
at the start, and run the loops once, instead of three times? Because that would be infecting the training data with knowledge of the future. When dealing with time-series data you need to keep thinking what data was available at what point in time, and remember that the test data represents unseen data that your model will be used on in production.
The final step was to export those data frames to csv files:
path
=
"/path/to/datasets/"
h2o.exportFile
(
train2000x
,
paste0
(
path
",football.train2.csv"
))
h2o.exportFile
(
valid2000x
,
paste0
(
path
",football.valid2.csv"
))
h2o.exportFile
(
test2000x
,
paste0
(
path
",football.test2.csv"
))
Unlike importing data, relative paths are not allowed by exportFile()
, so a full path must be given.
I used GLM almost out-of-the-box, the only customization being to specify lambda search (and even that was not needed). Of course, it didn’t need to be GLM, and I imagine any of random forest, GBM, or deep learning would have done the job just as well, if not better.
But GLM was good enough, and is quick, and stays quick when scaled across clusters.
To close this section, the comparison of HS and HS1 in football.train2.csv is shown in Figure 9-9. The distributions look almost exactly the same!
See I told you it would all work out in the end. (Smug grin.) Where did all the zeros go, do I hear you cry? It turns out that practically all of them were in the pre-2000 data. So, that huge imbalance went away when I truncated away the first 14,237 rows. I know, it feels like cheating. But, all’s well that ends well!
This chapter showed how to use H2O with Natural Language Processing, but was mainly about dealing with data when the correct answer is either unavailable or does not exist. Sometimes this is a means to an end, for instance when creating better or additional training data for a supervised learning algorithm, and sometimes it is the end in itself, such as clustering or filling in missing values.
This chapter also took a detailed look at dealing with missing data, and you should now know when to specify the missing_values_handling
parameter, when you need to do something as part of the data preparation stage, and when you don’t need to do anything. We also saw how H2O can be of use at any stage in your pipeline, not just to build the big model at the end.
We have now looked at four supervised learning algorithms, and a variety of unsupervised ones. The next chapter will take a quick look at everything in H2O that has not already been dealt with.
1 Each movie synopsis counts as a document; the corpus is the set of all 100 movie synopses.
2 See https://en.wikipedia.org/wiki/K-means%2B%2B for a description of this initialization algorithm.
3 They will change each time you run it, unless you set a seed (123 used here).
4 The combination of stopwords and stemming seemed to give some strange terms. Doing some proper grammatical parsing would improve results, though, would also give a huge increase in computation time. But all that is outside the scope of this book.
5 Maxout is not supported for auto-encoding.
6 Stacked auto-encoders usually model a single layer at a time; I wanted to show here that you don’t have to do it that way.
7 If you go back to Brandon Rose’s article, and code, you will see the genre of each movie is available. Could that be used for supervised learning?
8 It builds on top of k-means: in fact if you look on Flow you will see that for every GLRM model a k-means model has also been built.
9 Yes, it is strange you need the original data tfidf
, when the point of X
and Y
was that they can replace it, and so free up storage. Also strange that there appears to be three functions to do the same thing.
10 Remember to then do dt <- as.data.frame(test)
to get the data again, if you plan on any more client-side analysis with it.
11 I should’ve used a blank value instead of zero; then H2O would have loaded them as missing values, and they would not have appeared in the histogram at all. But, stay with me, it will all work out for the best.
12 That isn’t mean imputation. Mean imputation is the mean over the whole column. This is the average of selected fields over a row.
13 I was able to do all these columns in one go because it would fit in memory. If you are dealing with Bigger Data, this might have to be done one column at a time.
14 The bad choices here are very obvious. They won’t always be.
3.144.15.154