Architecture

The diagrams in this section have been sourced from the http://h2o.ai/ web site at http://h2o.ai/blog/2014/09/how-sparkling-water-brings-h2o-to-spark/ to provide a clear method of describing the way in which H2O Sparkling Water can be used to extend the functionality of Apache Spark. Both, H2O and Spark are open source systems. Spark MLlib contains a great deal of functionality, while H2O extends this with a wide range of extra functionality, including deep learning. It offers tools to munge (transform), model, and score the data. It also offers a web-based user interface to interact with.

The next diagram, borrowed from http://h2o.ai/, shows how H2O integrates with Spark. As we already know, Spark has master and worker servers; the workers create executors to do the actual work. The following steps occur to run a Sparkling water-based application:

  1. Spark's submit command sends the sparkling water JAR to the Spark master.
  2. The Spark master starts the workers, and distributes the JAR file.
  3. The Spark workers start the executor JVMs to carry out the work.
  4. The Spark executor starts an H2O instance.

The H2O instance is embedded with the Executor JVM, and so it shares the JVM heap space with Spark. When all of the H2O instances have started, H2O forms a cluster, and then the H2O flow web interface is made available.

Architecture

The preceding diagram explains how H2O fits into the Apache Spark architecture, and how it starts, but what about data sharing? How does data pass between Spark and H2O? The following diagram explains this:

Architecture

A new H2O RDD data structure has been created for H2O and Sparkling Water. It is a layer, based at the top of an H2O frame, each column of which represents a data item, and is independently compressed to provide the best compression ratio.

In the deep learning example, Scala code presented later in this chapter you will see that a data frame has been created implicitly from a Spark schema RDD and a columnar data item, income has been enumerated. I won't dwell on this now as it will be explained later but this is a practical example of the above architecture:

  val testFrame:DataFrame = schemaRddTest
  testFrame.replace( testFrame.find("income"), testFrame.vec("income").toEnum)

In the Scala-based example that will be tackled in this chapter, the following actions will take place:

  1. Data is being sourced from HDFS, and is being stored in a Spark RDD.
  2. Spark SQL is used to filter data.
  3. The Spark schema RDD is converted into an H2O RDD.
  4. The H2O-based processing and modeling occurs.
  5. The results are passed back to Spark for accuracy checking.

To this point, the general architecture of H2O has been examined, and the product has been sourced for use. The development environment has been explained, and the process by which H2O and Spark integrate has been considered. Now, it is time to delve into a practical example of the use of H2O. First though, some real-world data must be sourced for modeling purposes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.26.138