Spark ML

MLlib is a distributed machine learning framework above Spark core and handles machine-learning models used for transforming datasets in the form of RDDs. Spark MLlib is a library of machine-learning algorithms providing various algorithms such as logistic regression, Naive Bayes classification, Support Vector Machines (SVMs), decision trees, random forests, linear regression, Alternating Least Squares (ALS), and k-means clustering. Spark ML integrates very well with Spark core, Spark streaming, Spark SQL, and GraphX to provide a truly integrated platform where data can be real-time or batch.

We cover Spark ML in detail in Chapter 11, Learning Machine Learning - Spark MLlib and ML.

In addition, PySpark and SparkR are also available as means to interact with Spark clusters and use the Python and R APIs. Python and R integrations truly open up Spark to a population of Data scientists and Machine learning modelers as the most common languages used by Data scientists in general are Python and R. This is the reason why Spark supports Python integration and also R integration, so as to avoid the costly process of learning a new language of Scala. Another reason is that there might be a lot of existing code written in Python and R, and if we can leverage some of the code, that will improve the productivity of the teams rather than building everything again from scratch.

There is increasing popularity for, and usage of, notebook technologies such as Jupyter and Zeppelin, which make it significantly easier to interact with Spark in general, but particularly very useful in Spark ML where a lot of hypotheses and analysis are expected.

Table of Contents for Spark ML

Create new playlist

Sign In

Sign Up

Table of Contents for
Spark ML