Chapter 7. Extending Spark with H2O

H2O is an open source system, developed in Java by http://h2o.ai/ for machine learning. It offers a rich set of machine learning algorithms, and a web-based data processing user interface. It offers the ability to develop in a range of languages: Java, Scala, Python, and R. It also has the ability to interface to Spark, HDFS, Amazon S3, SQL, and NoSQL databases. This chapter will concentrate on H2O's integration with Apache Spark using the Sparkling Water component of H2O. A simple example, developed in Scala, will be used, based on real data to create a deep-learning model. This chapter will:

  • Examine the H2O functionality
  • Consider the necessary Spark H2O environment
  • Examine the Sparkling Water architecture
  • Introduce and use the H2O Flow interface
  • Introduce deep learning with an example
  • Consider performance tuning
  • Examine data quality

The next step will be to provide an overview of the H2O functionality, and the Sparkling Water architecture that will be used in this chapter.

Overview

Since it is only possible to examine, and use, a small amount of H2O's functionality in this chapter, I thought that it would be useful to provide a list of all of the functional areas that it covers. This list is taken from http://h2o.ai/ website at http://h2o.ai/product/algorithms/ and is based upon munging/wrangling data, modeling using the data, and scoring the resulting models:

Process

Model

The score tool

Data profiling

Generalized Linear Models (GLM)

Predict

Summary statistics

Decision trees

Confusion Matrix

Aggregate, filter, bin, and derive columns

Gradient Boosting (GBM)

AUC

Slice, log transform, and anonymize

K-Means

Hit Ratio

Variable creation

Anomaly detection

PCA Score

PCA

Deep learning

Multi Model Scoring

Training and validation sampling plan

Naïve Bayes

 
 

Grid search

 

The following section will explain the environment used for the Spark and H2O examples in this chapter and it will also explain some of the problems encountered.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.223.10