Summary

You've learned that, as in many other places, the introduction of DataFrames leads to the development of complementary frameworks that are not using RDDs directly anymore. This is also the case for machine learning but there is much more to it. Pipeline actually takes machine learning in Apache Spark to the next level as it improves the productivity of the data scientist dramatically.

The compatibility between all intermediate objects and well-thought-out concepts is just awesome. This framework makes it very easy to build your own stacked and bagged model with the full support of the underlying performance optimizations with Tungsten and Catalyst.

Great! Finally, we've applied the concepts that we discussed on a real dataset from a Kaggle competition, which is a very nice starting point for your own machine learning project with Apache SparkML. The next Chapter covers Apache SystemML, which is a 3rd party machine learning library for Apache Spark. Let's see why it is useful and what the differences are to SparkML.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary