Connecting R and H2O

Because H2O is Java-based software with an R wrapper, to connect R to it we must initialize an instance of H2O and also connect R with it, linking or passing data and model commands to it. In this section, we will show how to get everything set up to train a model using H2O.

Initializing H2O

To initialize an H2O cluster, we use the h2o.init() function. Initializing a cluster will also set up a lightweight web server that allows interaction with the software via a local webpage. Generally, the h2o.init() function has sensible default values, but we can customize many aspects of it, and it may be particularly good to customize the number of cores/threads to use as well as how much memory we are willing for it to use, which can be accomplished as in the following code using the max_mem_size and nthreads arguments. In the code that follows, we initialize an H2O cluster to use two threads and up to three gigabytes of memory. After the code, R will indicate the location of log files, the Java version, and details about the cluster:

cl <- h2o.init(
  max_mem_size = "3G",
  nthreads = 2)

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:UsersjwileAppDataLocalTempRtmpuelhZm/h2o_jwile_started_from_r.out
    C:UsersjwileAppDataLocalTempRtmpuelhZm/h2o_jwile_started_from_r.err

java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b18)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b18, mixed mode)

.Successfully connected to http://127.0.0.1:54321/ 

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 735 milliseconds 
    H2O cluster version:        3.6.0.8 
    H2O cluster name:           H2O_started_from_R_jwile_ndx127 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   2.67 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  2 
    H2O cluster healthy:        TRUE 

Once the cluster is initialized, we can interface with it either using R or using the web interface available at the local host (127.0.0.1:54321); it is shown in Figure 1.6:

Initializing H2O

Figure 1.6

Linking datasets to an H2O cluster

There are a couple of ways to get data into an H2O cluster. If the dataset is already loaded into R, you can simply use the as.h2o() function as shown in the following code:

h2oiris <- as.h2o(
  droplevels(iris[1:100, ]))

We can check the results by typing the R object, h2oiris, which is simply an object that holds a reference to the H2O data. The R API queries H2O when we try to print it:

h2oiris

This returns the following output:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

[100 rows x 5 columns]

We can also check the levels of factor variables, such as the Species variable, as shown in the following:

h2o.levels(h2oiris, 5)
[1] setosa     versicolor

In real-world uses, it is more likely that the data already exists somewhere; rather than load the data into R only to export it into H2O (a costly operation as it creates an unnecessary copy of the data in R), we can just load data directly into H2O. First we will create a CSV file based on the built-in mtcars dataset, then we will tell the H2O instance to read the data using R. Printing again shows the data:

write.csv(mtcars, file = "mtcars.csv")

h2omtcars <- h2o.importFile(
  path = "mtcars.csv")

h2omtcars
                 C1  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
[32 rows x 12 columns]

Finally, the data need not be located on the local disk. We can also ask H2O to read in data from a URL as shown in this last example, which uses a dataset made available from the UCLA Statistical Consulting Group:

h2obin <- h2o.importFile(
  path = "http://www.ats.ucla.edu/stat/data/binary.csv")

h2obin
  admit gre  gpa rank
1     0 380 3.61    3
2     1 660 3.67    3
3     1 800 4.00    1
4     1 640 3.19    4
5     0 520 2.93    4
6     1 760 3.00    2

[400 rows x 4 columns]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.37.254