Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Extending Spark with H2O

H2O is an open source system, developed in Java by http://h2o.ai/ for machine learning. It offers a rich set of machine learning algorithms, and a web-based data processing user interface. It offers the ability to develop in a range of languages: Java, Scala, Python, and R. It also has the ability to interface to Spark, HDFS, Amazon S3, SQL, and NoSQL databases. This chapter will concentrate on H2O's integration with Apache Spark using the Sparkling Water component of H2O. A simple example, developed in Scala, will be used, based on real data to create a deep-learning model. This chapter will:

Examine the H2O functionality
Consider the necessary Spark H2O environment
Examine the Sparkling Water architecture
Introduce and use the H2O Flow interface
Introduce deep learning with an example
Consider performance tuning
Examine data quality

The next step will be to provide an overview of the H2O functionality, and the Sparkling Water architecture that will be used in this chapter.

Overview

Since it is only possible to examine, and use, a small amount of H2O's functionality in this chapter, I thought that it would be useful to provide a list of all of the functional areas that it covers. This list is taken from http://h2o.ai/ website at http://h2o.ai/product/algorithms/ and is based upon munging/wrangling data, modeling using the data, and scoring the resulting models:

Process	Model	The score tool
Data profiling	Generalized Linear Models (GLM)	Predict
Summary statistics	Decision trees	Confusion Matrix
Aggregate, filter, bin, and derive columns	Gradient Boosting (GBM)	AUC
Slice, log transform, and anonymize	K-Means	Hit Ratio
Variable creation	Anomaly detection	PCA Score
PCA	Deep learning	Multi Model Scoring
Training and validation sampling plan	Naïve Bayes
	Grid search

The following section will explain the environment used for the Spark and H2O examples in this chapter and it will also explain some of the problems encountered.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Extending Spark with H2O

Create new playlist

Sign In

Sign Up

Chapter 7. Extending Spark with H2O

Overview

Table of Contents for
7. Extending Spark with H2O