H2O is an open source system, developed in Java by http://h2o.ai/ for machine learning. It offers a rich set of machine learning algorithms, and a web-based data processing user interface. It offers the ability to develop in a range of languages: Java, Scala, Python, and R. It also has the ability to interface to Spark, HDFS, Amazon S3, SQL, and NoSQL databases. This chapter will concentrate on H2O's integration with Apache Spark using the Sparkling Water component of H2O. A simple example, developed in Scala, will be used, based on real data to create a deep-learning model. This chapter will:
The next step will be to provide an overview of the H2O functionality, and the Sparkling Water architecture that will be used in this chapter.
Since it is only possible to examine, and use, a small amount of H2O's functionality in this chapter, I thought that it would be useful to provide a list of all of the functional areas that it covers. This list is taken from http://h2o.ai/ website at http://h2o.ai/product/algorithms/ and is based upon munging/wrangling data, modeling using the data, and scoring the resulting models:
Process |
Model |
The score tool |
---|---|---|
Data profiling |
Generalized Linear Models (GLM) |
Predict |
Summary statistics |
Decision trees |
Confusion Matrix |
Aggregate, filter, bin, and derive columns |
Gradient Boosting (GBM) |
AUC |
Slice, log transform, and anonymize |
K-Means |
Hit Ratio |
Variable creation |
Anomaly detection |
PCA Score |
PCA |
Multi Model Scoring | |
Training and validation sampling plan |
Naïve Bayes | |
Grid search |
The following section will explain the environment used for the Spark and H2O examples in this chapter and it will also explain some of the problems encountered.
18.118.2.225