Summary

Our continuing theme when examining both Apache Hadoop and Spark is that none of these systems stand alone. They need to be integrated to form ETL-based processing systems. Data needs to be sourced and processed in Spark and then passed to the next link in the ETL chain or stored. We hope that this chapter showed you that Spark functionality can be extended with extra libraries and systems such as H2O, DeepLearning4j. Even Apache SystemML supports DeepLearning now and TensorFlow can be run within Apache Spark using TensorFrames and TensorSpark.

Although Apache Spark MLlib and SparkML has a lot of functionality, the combination of H2O Sparkling Water and the Flow web interface provides an extra wealth of data analysis modeling options. Using Flow, you can also visually and interactively process your data. We hope that this chapter shows you, even though it cannot cover all that all these libraries offer, that the combination of Spark and third-party libraries widens your data processing possibilities.

We hope that you found this chapter useful. In the next two chapters we will cover Graph Processing, so stay tuned.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary