Summary

My continuing theme, when examining both Apache Hadoop and Spark, is that none of these systems stand alone. They need to be integrated to form ETL-based processing systems. Data needs to be sourced and processed in Spark, and then passed to the next link in the ETL chain, or stored. I hope that this chapter has shown you that Spark functionality can be extended with extra libraries, and systems such as H2O.

Although Apache Spark MLlib (machine learning library) has a lot of functionality, the combination of H2O Sparkling Water and the Flow web interface provides an extra wealth of data analysis modeling options. Using Flow, you can also visually, and interactively process your data. I hope that this chapter shows you, even though it cannot cover all that H2O offers, that the combination of Spark and H2O widens your data processing possibilities.

I hope that you have found this chapter useful. As a next step, you might consider checking the http://h2o.ai/ website or the H2O Google group, which is available at https://groups.google.com/forum/#!forum/h2ostream.

The next chapter will examine the Spark-based service https://databricks.com/, which will use Amazon AWS storage for Spark cluster creation in the cloud.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.34.146