Chapter 10. Everything Else

There is lots still to say: enough to fill another book! This chapter will introduce quite a few topics, without going into too much detail, but pointing you toward where you can find out more.

This chapter starts with a look at where to get the latest documentation, and how to Use The Source, Luke! Then we look at how to upgrade H2O, and how to install it from source. Which leads on to how to set up clusters, which leads on to Spark and Sparkling Water. Finally, a look at other algorithms: naive bayes and ensembles.

Staying on Top of and Poking into Things

H2O is well-documented, so http://docs.h2o.ai should be your first port of call. The latest user guide is at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html.

If you want to see if, say, any new parameters have been added to GBM, you could go to the REST API Reference, then find GBMParametersV3. Alternatively, regularly check Changes.md over on the GitHub site!

If you wanted to see how GBM is implemented in Java you would start at the Javadocs, find "hex.tree.gbm,” then "GBM.” The corresponding code will be over on Github.

Random forest is called DRF in the REST endpoints, and hex.tree.drf in the Java source; everything else is fairly much named how you would expect.

Installing the Latest Version

Installing from packages, as shown back in the first chapter (“Install H2O with R (CRAN)” and “Install H2O with Python (pip)”) is going to be good enough for most people. However, installing the latest stable version (or even the latest bleeding-edge version) is not that much harder.

Unfortunately the latest versions do not have nice aliases, like “stable” and “nightly,” so I cannot just give you some instructions to copy and paste; it changes with each release. Instead, go to the H2O download page, click the link for the latest stable release of H2O, and follow the instructions for either R or Python there.

Instead, go to the H2O R download page or the H2O Python download page and follow the instructions given there.

Building from Source

Some people are hopeless control freaks, but not you—you have a genuine reason to need to compile everything from source. How to build H2O from source are the instructions you are looking for. There are a couple of dependencies beyond what you need to just install and run H2O, but it is not too bad. There are subsections for various platforms (including Hadoop).

Running from the Command Line

Throughout the book we have let our R or Python client start H2O for us. This is the easiest way, and also makes sure the versions are compatible.

But if the client finds H2O is already running, it will happily connect to that running instance. Starting H2O separately like this has some advantages. The main one is stability. When you close your client it closes H2O for you, too. If you only ever have one client, and you never need H2O running when that client is closed, that is just what you want; if not, it will cause problems!

The following example shows how to start H2O with 3GB of memory, on the local IP address and default port:

java -Xmx3g -jar /path/to/h2o.jar

The other reasons for using the command line usually have to do with starting clusters on a remote server. You can get help on all the current command-line options with:

java -jar /path/to/h2o.jar --help
Note

The H2O you start must still be the same version as the version of the client you have installed.

If you need to pass authentication details to S3 or HDFS, you can specify "core-site.xml" like this:

java -jar h2o.jar -hdfs_config core-site.xml

See the documentation on what should go inside that XML file.

I mentioned in “Privacy” that H2O calls Google Analytics to record which versions are being used. If you need to opt out, the command-line argument -ga_opt_out is an alternative to creating the .h2o_no_collect file in your home directory:

java -Xmx3g -jar /path/to/h2o.jar -ga_opt_out

Clusters

H2O data structures, and its algorithms, have been built with clusters in mind, rather than as an afterthought. Even so, there are a few restrictions with H2O clusters:

  • Each H2O node on the cluster must be the same size. So if the smallest machine in your cluster has 3GB free, every node must be given 3GB, even if some of them have 64GB.1

  • Machines cannot be added once the cluster starts up.

  • If any machine dies, the whole cluster must be rebuilt.2 No high availability support.

  • Each node must be running the same version of h2o.jar. (And if you are planning to control the cluster from, say, an R client, then that also must be running the same API version as your h2o.jar.)

  • Nodes should be physically close, to minimize network latency.

For more than one of those reasons, you should favor a few big nodes over lots of small nodes. And for models that take many hours to create, use regular checkpoints (see “Checkpoints” in Chapter 4) and model-saving (see “Model Files” in Chapter 2).

An H2O cluster can be created in one of two ways:

Flatfile

A simple textfile giving the IP address and port of each node in the cluster. You must prepare this file identically on each machine. You specify it when starting h2o.jar with -flatfile myfile.txt.

Auto-discovery

If you specify -network 192.168.1.0/24 when starting h2o.jar, then it will hunt through that subnet to find all nodes and join them together in a cluster. This is slower than the flatfile, but saves you having to create that file and upload it to every node beforehand.

You can also give your cluster a name, and specify this name when starting on each node. I recommend you do this (and name your flatfile with the same name). It becomes essential if you want to run two or more clusters on the same subnet.

You have a few minutes to get all the nodes started before the first one will complain and shut down. However, as soon as a client (R, Python, Flow) connects to any node in the cluster, the cluster will lock down, and no further nodes can be added.

To connect to your cluster, just use h2o.init() giving the IP address of any node. There is no master node; they are all equal peers. However, the first node listed in the cluster status (get it from Flow, h2o.clusterStatus() in R, or h2o.cluster_status() in Python) sometimes has special status when it comes to importing or exporting data or models. And sometimes it is the node your client is connected to. I recommend you connect to the first node when you plan to be using the local filesystem for import or export, to avoid any confusion.3 For importing data, the data has to be visible by all nodes in the cluster, which generally means you will need to have a copy on each node, in exactly the same place. I find it easier to be using S3 or HDFS. Alternatively, run your R or Python client on the node that has the data, and use upload file instead of import file.

EC2

Setting up an H2O cluster on Amazon EC2 machines is nice and easy because of a set of pre-made scripts. Those scripts have a Python/boto dependency, but this is solely for running the AWS commands. So if you already have scripts to set up AWS servers in some other language, it is quite possible to learn what you need from these scripts and then integrate it into the code you already have.

You will find the Flow UI listening on port 54321 of each machine in your cluster. Extract the global DNS name from either the script output, or from the EC2 Management Console. A nice touch is that each machine in the cluster will also come with RStudio installed; you will find it on port 8787.

Those scripts implement the following steps:

  1. Start a set of EC2 instances, recording their IP addresses.

  2. Once they are all running, get that information, and h2o.jar, on each node.

  3. Start h2o.jar on each node.

SSH is used for steps 2 and 3.

The script default is to have no security group. Instead, I recommend that you create a security group4 called “for_h2o” that allows inbound connections on TCP ports 22, 8787, 54321, and 54322, and UDP ports 54321 and 54322.5 Then find the “securityGroupName” line in h2o-cluster-launch-instances.py and set it to read:

securityGroupName = 'for_h2o'

Other Cloud Providers

There are no ready-made scripts for Azure, Google, Rackspace, Digital Ocean, and the other cloud providers. But there is no reason H2O cannot run there. Assuming you are familiar with scripting their APIs, you should be able to borrow and adapt the EC2 scripts quite easily.

Hadoop

Running H2O on Hadoop is like the normal setup, but you should fetch and install a version for your specific Hadoop setup by going to the H2O download page, choosing the latest stable release, then clicking on the “Install on Hadoop” tab, and following the instructions there. Then instead of java h2o.jar, you will do hadoop jar h2odriver.jar. (There are additional parameters: view the previous URL to get the latest ones.) After that you can use Flow, or connect to H2O using your R or Python client, just as with the local version. Data can be loaded from HDFS as we’ve already seen in earlier chapters.

Spark / Sparkling Water

In some ways Spark and H2O are competing products: they are both about analyzing Big Data, in-memory. However, each has its strengths,6 and “Sparkling Water” is a way to get them to work together closely so you can get the best of each world. There are versions for Spark 2.0, 1.6, 1.5, 1.4, and 1.3.

Sparkling Water is run as a regular Spark application. You run the H2O functions using Scala.7 Once the data is inside an H2O data frame, it can be shared with Spark functions, without involving a memory copy (using the asRDD() or asDataFrame() functions).

To learn more about Sparkling Water, the best resources are the booklet, which can be found at http://docs.h2o.ai or directly at http://bit.ly/2g2zfJg, and then the directory of examples on GitHub.

Naive Bayes

There is another supervised machine-learning algorithm, naive bayes. I didn’t look at it in the main section of the book, as it is a bit more limited, with binomial and multinomial classifications only, and not so many parameters, and well, basically, there wasn’t enough space to include everything.

Naive bayes is often associated with NLP (Natural Language Processing) applications, such as spam recognition or sentiment analysis. But, to get you started, Example 10-1 is the iris example from the first chapter (“Our First Learning”), done with naive bayes instead of deep learning.

Example 10-1. Naive bayes on the Iris data set, in R
library(h2o)
h2o.init(nthreads = -1)

datasets <- "https://raw.githubusercontent.com/DarrenCook/h2o/bk/datasets/"
data <- h2o.importFile(paste0(datasets",iris_wheader.csv"))
y <- "class"
x <- setdiff(names(data), y)
parts <- h2o.splitFrame(data, 0.8)
train <- parts[[1]]
test <- parts[[2]]

m <- h2o.naiveBayes(x, y, train)
p <- h2o.predict(m, test)

That manages the same 28 out of 30 score I got in the first chapter, so naive bayes is certainly not an algorithm just for NLP applications. Your next stop should be the API documentation (it is supported from R, Python, and also Flow).

Ensembles

Imagine you are putting together a team of three for a quiz night. Further, imagine that you are a god when it comes to Python, are rather good at math and physics, think sports are for losers, and don’t even own a TV, preferring to play online video games. For some of you that will require a lot of imagination, I’m sure. One option, what with you being so clever, is to just enter the quiz night solo. Alternatively, you could invite Bob (an R guy, but on the plus side he is better at statistics and League of Legends than you) and Betsy (she studied chemistry, but she is not so bad as she uses Python and you like the same video games) to be in your team, the Brainy Boffins. Or should you invite your brother-in-law, Viv, who is a lawyer and always watching the TV news and shouting at politicians to clean up their act; and old school friend, Valerie, who knows everything about hockey, football, and half a dozen other sports, and can also tell you who performed at the Super Bowl halftime show, every year, for the past 20 years? Valerie can be very boring.

I hope you chose wisely. This is the basic idea behind ensemble algorithms: a group of algorithms, conferring over their final answer, is better than a single algorithm, because it can balance out the weaknesses and oversights. And the greater the differences between the algorithms, the better.

Having decided to use an ensemble you have three main decisions to make:

  • Whether to use a library, or roll your own

  • What member models

  • How to combine their results

Taking them in order: H2O has an ensemble package; it doesn’t come with H2O, so you need to install it separately. It only supports R currently.

You could make your member models by training the best model for each algorithm. An alternative approach is to run a random grid search over a wide range of parameters, and choose the best few models that have a good diversity of model parameters. For example, if your top-two deep learning algorithms had different network layouts, one with (200,200) hidden layers, and the other with (50,40,40,40), then they are likely to be useful ensemble members, as they will be seeing the world in different ways.

To combine results, you could take a page out of random forest’s book: use averaging for regression, and use the most common result for categorization. Or you could use any supervised learner introduced in this book, to take take the output of each member model, and decide the ensemble’s output.

Stacking: h2o.ensemble

I mentioned H2O’s ensemble package, h2o.ensemble(). At the time of writing it only supports regression and binomial classification, so I’m going to show a quick example applied to the building energy data set, which was a regression problem. It can be used in one of two ways:

h2o.ensemble()

You specify wrapper functions for each element of the ensemble. It then builds those models, then trains a metalearner to combine their results.

h2o.stack()

You pre-build the constituent models, and pass them in as a list. It trains a metalearner to combine their results.

So, use h2o.stack() if you already have the models, and h2o.ensemble() if you don’t. I personally find specifying the wrapper functions as much work as just making the model myself, so for that reason I will show h2o.stack(). But you need to prepare the models in a certain way:

  • All models must have been built with cross-validation

  • All the same value for nfolds

  • fold_assignment = "Modulo"

  • keep_cross_validation_predictions = TRUE

If you use h2o.ensemble() these details are taken care of for you:

library(h2oEnsemble)

source("load.building_energy.R")

RFd <- h2o.randomForest(x, y, train, model_id="RF_defaults", nfolds=10,
  fold_assignment = "Modulo", keep_cross_validation_predictions = T)
GBMd <- h2o.gbm(x, y, train, model_id="GBM_defaults", nfolds=10,
  fold_assignment = "Modulo", keep_cross_validation_predictions = T)
GLMd <- h2o.glm(x, y, train, model_id="GLM_defaults", nfolds=10,
  fold_assignment = "Modulo", keep_cross_validation_predictions = T)
DLd <- h2o.deeplearning(x, y, train, model_id="DL_defaults", nfolds=10,
  fold_assignment = "Modulo", keep_cross_validation_predictions = T)

models <- c(RFd, GBMd, GLMd, DLd)

The preceding listing prepares the models (default settings, except for the cross-validation settings required for the ensemble), and puts them in a list object:

m_stack <- h2o.stack(models, response_frame = train[,y])
h2o.ensemble_performance(m_stack, test)

This code then calls h2o.stack giving that list, and requires you specify a one-column H2O frame with the correct answer; this is different to the other H2O algorithms, which take train and y as two separate parameters.8 I then call h2o.ensemble_performance() to evaluate how it does on the test data. It will output how each individual model did, and then how the ensemble did. The results look like the following:

Base learner performance, sorted by specified metric:
       learner      MSE
3 GLM_defaults 9.013281
4  DL_defaults 7.651445
1  RF_defaults 3.626491
2 GBM_defaults 2.547896

H2O Ensemble Performance on <newdata>:
Family: gaussian

Ensemble performance (MSE): 2.55156175744006

Oh. The ensemble did slightly worse than just the best model by itself. If I use an ensemble of just the best model, and models of close strength9 (i.e., the random forest and the GBM) I instead get a slight improvement:

Base learner performance, sorted by specified metric:
       learner      MSE
1  RF_defaults 3.626491
2 GBM_defaults 2.547896

H2O Ensemble Performance on <newdata>:
Family: gaussian

Ensemble performance (MSE): 2.51780794131118

You might want to go ahead and use this on the tuned models… but, because the deep learning model (with an MSE of 0.388) is so much stronger than any of the other models, it works best as a team of one.

Categorical Ensembles

For categorization models, I use a different approach, taking advantage of the fact that H2O doesn’t just return a prediction, it returns a confidence in each category being the correct answer. You may remember back, way back, in the first chapter of this book, that we looked at those confidence numbers; here it is again as a reminder:

predict         Iris-setosa  Iris-versicolor  Iris-virginica
-----------     -----------  ---------------  --------------
Iris-setosa     0.999016     0.0009839        1.90283e-19
Iris-setosa     0.998988     0.0010118        1.40209e-20
...
Iris-virginica  1.5678e-08   0.3198963        0.680104
Iris-versicolor 2.3895e-08   0.9863869        0.013613

It was 99.9% sure about those two setosa, 98% sure about the versicolor, but only 68% sure about the virginica. What you tend to see, with a good model, is that almost all the answers it felt confident about were correct, but that most of the wrong answers had lower confidence. How do ensembles comes into this? Think back to our quiz team, on a four-choice question: Viv is fairly sure (he says 60% sure) the answer is benzene, while you are guessing and think 40% benzene and 20% each of the other answers, but Valerie tells you she is certain that C6H12O6 is glucose, because she has just finished reading a book on diabetes. You go with Valerie’s confidence… and win the contest.

That idea can be wrapped up in just a few lines of code:

predictTeam <- function(models, data){
probabilities <- sapply(models, function(m){
  p <- h2o.predict(m, data)
  as.matrix(p[,setdiff(colnames(p), "predict")])
  }, simplify = "array")
apply(probabilities, 1, function(m) which.max(apply(m, 1, sum)) )
}

It works by getting each model’s predictions, grabbing the probabilities as a three-dimensional array, then working through that array to choose the answer that has the highest average confidence across all models.10 It returns the index of the category it thinks is best. You will see it in action in the next chapter, and for more details, and more code, see my blog post on the subject.

Summary

Well, that was a whirlwind tour of some things I didn’t have space in the book to go into in more detail. The next chapter, the final one in this book, will bring together all the results of the supervised learning experiments on each of the three data sets, and also give some ideas for what to do when they are not good enough.

1 You could try running 20 3GB nodes on your 64GB machine, but chances are you’d be better off just running a single 64GB node, and not bothering with the little 3GB machine.

2 And the whole cluster becomes unusable if a single node gets removed: you cannot even export data or models from the rest of the cluster.

3 This behavior is in flux at the time of writing.

4 It needs to be created in each region you want to use those scripts.

5 If you will never use RStudio on the cluster you can drop 8787; similarly, if you will use RStudio or other R/Python clients only running on the cluster, you do not need to open the 54321/54322 ports.

6 H2O is considered to be considerably faster than the algorithms in MLLib, though the latter offers a lot more algorithms. Spark is currently better at the data preparation and data munging steps than H2O.

7 From the documentation site you can also find links to PySparkling and RSparkling, which are Python and R interfaces to Sparkling Water.

8 I left the metalearner parameters as the default, so it will use h2o.glm to decide how to combine the four models. This could have been any of the other H2O supervised learning algorithms.

9 To avoid biasing the results, we need to decide to do this before evaluating on test data, so base it on the evaluation on the results on train.

10 Implemented with sum, rather than mean, as they are equivalent inside a which.max().

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.191.134