We can also use the PCA implementation provided by H2O. (We've already seen H2O in the previous chapter and mentioned it along the book.)
With H2O, we first need to turn on the server with the init
method. Then, we dump the dataset on a file (precisely, a CSV file) and finally run the PCA analysis. As the last step, we shut down the server.
We're trying this implementation on some of the biggest datasets seen so far—the one with 100K observations and 100 features and the one with 10K observations and 2,500 features:
In: import h2o from h2o.transforms.decomposition import H2OPCA h2o.init(max_mem_size_GB=4) def testH2O_pca(nrows, ncols, k=20): temp_file = tempfile.NamedTemporaryFile().name X, _ = make_blobs(nrows, n_features=ncols, random_state=101) np.savetxt(temp_file, np.c_[X], delimiter=",") del X pca = H2OPCA(k=k, transform="NONE", pca_method="Power") tik = time.time() pca.train(x=range(100), training_frame=h2o.import_file(temp_file)) print "H2OPCA on matrix ", (nrows, ncols), " done in ", time.time() - tik, "seconds" os.remove(temp_file) testH2O_pca(100000, 100) testH2O_pca(10000, 2500) h2o.shutdown(prompt=False) Out:[...] H2OPCA on matrix (100000, 100) done in 12.9560530186 seconds [...] H2OPCA on matrix (10000, 2500) done in 10.1429388523 seconds
As you can see, in both cases, H2O indeed performs very fast and is well-comparable (if not outperforming) to Scikit-learn.
3.145.177.115