Because H2O is Java-based software with an R wrapper, to connect R to it we must initialize an instance of H2O and also connect R with it, linking or passing data and model commands to it. In this section, we will show how to get everything set up to train a model using H2O.
To initialize an H2O cluster, we use the h2o.init()
function. Initializing a cluster will also set up a lightweight web server that allows interaction with the software via a local webpage. Generally, the h2o.init()
function has sensible default values, but we can customize many aspects of it, and it may be particularly good to customize the number of cores/threads to use as well as how much memory we are willing for it to use, which can be accomplished as in the following code using the max_mem_size
and nthreads
arguments. In the code that follows, we initialize an H2O cluster to use two threads and up to three gigabytes of memory. After the code, R will indicate the location of log files, the Java version, and details about the cluster:
cl <- h2o.init( max_mem_size = "3G", nthreads = 2) H2O is not running yet, starting it now... Note: In case of errors look at the following log files: C:UsersjwileAppDataLocalTempRtmpuelhZm/h2o_jwile_started_from_r.out C:UsersjwileAppDataLocalTempRtmpuelhZm/h2o_jwile_started_from_r.err java version "1.8.0_66" Java(TM) SE Runtime Environment (build 1.8.0_66-b18) Java HotSpot(TM) 64-Bit Server VM (build 25.66-b18, mixed mode) .Successfully connected to http://127.0.0.1:54321/ R is connected to the H2O cluster: H2O cluster uptime: 1 seconds 735 milliseconds H2O cluster version: 3.6.0.8 H2O cluster name: H2O_started_from_R_jwile_ndx127 H2O cluster total nodes: 1 H2O cluster total memory: 2.67 GB H2O cluster total cores: 4 H2O cluster allowed cores: 2 H2O cluster healthy: TRUE
Once the cluster is initialized, we can interface with it either using R or using the web interface available at the local host (127.0.0.1:54321
); it is shown in Figure 1.6:
There are a couple of ways to get data into an H2O cluster. If the dataset is already loaded into R, you can simply use the as.h2o()
function as shown in the following code:
h2oiris <- as.h2o( droplevels(iris[1:100, ]))
We can check the results by typing the R object, h2oiris
, which is simply an object that holds a reference to the H2O data. The R API queries H2O when we try to print it:
h2oiris
This returns the following output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa [100 rows x 5 columns]
We can also check the levels of factor variables, such as the Species
variable, as shown in the following:
h2o.levels(h2oiris, 5) [1] setosa versicolor
In real-world uses, it is more likely that the data already exists somewhere; rather than load the data into R only to export it into H2O (a costly operation as it creates an unnecessary copy of the data in R), we can just load data directly into H2O. First we will create a CSV file based on the built-in mtcars
dataset, then we will tell the H2O instance to read the data using R. Printing again shows the data:
write.csv(mtcars, file = "mtcars.csv") h2omtcars <- h2o.importFile( path = "mtcars.csv") h2omtcars C1 mpg cyl disp hp drat wt qsec vs am gear carb 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 [32 rows x 12 columns]
Finally, the data need not be located on the local disk. We can also ask H2O to read in data from a URL as shown in this last example, which uses a dataset made available from the UCLA Statistical Consulting Group:
h2obin <- h2o.importFile( path = "http://www.ats.ucla.edu/stat/data/binary.csv") h2obin admit gre gpa rank 1 0 380 3.61 3 2 1 660 3.67 3 3 1 800 4.00 1 4 1 640 3.19 4 5 0 520 2.93 4 6 1 760 3.00 2 [400 rows x 4 columns]
18.118.37.254