Here, we're comparing the K-means implementation of H2O with Scikit-learn. More specifically, we will run the mini-batch experiment using H2OKMeansEstimator
, the object for K-means available in H2O. The setup is similar to the one shown in the PCA with H2O section, and the experiment is the same as seen in the preceding section:
In:import h2o from h2o.estimators.kmeans import H2OKMeansEstimator h2o.init(max_mem_size_GB=4) def testH2O_kmeans(X, k): temp_file = tempfile.NamedTemporaryFile().name np.savetxt(temp_file, np.c_[X], delimiter=",") cls = H2OKMeansEstimator(k=k, standardize=True) blobdata = h2o.import_file(temp_file) tik = time.time() cls.train(x=range(blobdata.ncol), training_frame=blobdata) fit_time = time.time() - tik os.remove(temp_file) return fit_time piece_of_dataset = pd.read_csv(census_csv_file, iterator=True).get_chunk(500000).drop('caseid', axis=1).as_matrix() time_results = {4: [], 8:[], 12:[]} dataset_sizes = [20000, 200000, 500000] for dataset_size in dataset_sizes: print "Dataset size:", dataset_size X = piece_of_dataset[:dataset_size,:] for K in [4, 8, 12]: print "K:", K fit_time = testH2O_kmeans(X, K) time_results[K].append(fit_time) plt.plot(dataset_sizes, time_results[4], 'r', label='K=4') plt.plot(dataset_sizes, time_results[8], 'g', label='K=8') plt.plot(dataset_sizes, time_results[12], 'b', label='K=12') plt.xlabel("Training set size") plt.ylabel("Training time") plt.legend(loc=0) plt.show() testH2O_kmeans(100000, 100) h2o.shutdown(prompt=False) Out:
Thanks to the H2O architecture, its implementation of K-means is very fast and scalable and able to perform the clustering of the 500K point datasets in less than 30 seconds for all the selected Ks.
18.218.91.239