PCA with H2O

We can also use the PCA implementation provided by H2O. (We've already seen H2O in the previous chapter and mentioned it along the book.)

With H2O, we first need to turn on the server with the init method. Then, we dump the dataset on a file (precisely, a CSV file) and finally run the PCA analysis. As the last step, we shut down the server.

We're trying this implementation on some of the biggest datasets seen so far—the one with 100K observations and 100 features and the one with 10K observations and 2,500 features:

In: import h2o
from h2o.transforms.decomposition import H2OPCA
h2o.init(max_mem_size_GB=4)

def testH2O_pca(nrows, ncols, k=20):
    temp_file = tempfile.NamedTemporaryFile().name
    X, _ = make_blobs(nrows, n_features=ncols, random_state=101)
np.savetxt(temp_file, np.c_[X], delimiter=",")
    del X

pca = H2OPCA(k=k, transform="NONE", pca_method="Power")
    tik = time.time()
    pca.train(x=range(100), 
training_frame=h2o.import_file(temp_file))

    print "H2OPCA on matrix ", (nrows, ncols), 
" done in ", time.time() - tik, "seconds"
os.remove(temp_file)

testH2O_pca(100000, 100)
testH2O_pca(10000, 2500)
h2o.shutdown(prompt=False)

Out:[...]
H2OPCA on matrix  (100000, 100) done in  12.9560530186 seconds
[...]
H2OPCA on matrix  (10000, 2500) done in  10.1429388523 seconds

As you can see, in both cases, H2O indeed performs very fast and is well-comparable (if not outperforming) to Scikit-learn.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.177.115