Introduction to machine learning

Spark MLlib is a general purpose machine learning library that gives all the benefits of Spark, that is, distributed computing, scalability, and fault tolerance along with easy inter-operability among different Spark modules and other libraries. Machine learning is not a new concept and certainly not solely developed by Spark, what makes Spark MLlib stand out on its own is its ease of use and generalization in developing any ML algorithm using pipeline. Again, pipeline as a concept has been used by the scikit-learn library and Apache Spark has done a brilliant job by using the same concept, but in a distributed mode. Generally, Spark's machine learning module ships:

  1. Common machine learning algorithms.
  2. Tools to load, extract, transform, and select features.

 

  1. The ability to chain multiple operations using pipeline.
  2. The ability to save and load algorithms, models, and pipelines.
  3. The capability of performing linear algebra and statistical operations.

Over the years just like machine learning, Spark's implementation of it has also gone through a huge transformation. The initial release of Spark used the spark.mllib package that had RDD-based APIs, however, with Spark 2.0 a new package called spark.ml was introduced that operated upon dataframe. This shift from RDD to dataframe had many advantages, such as:

  • All the benefits of Tungsten and Catalyst optimizer can be applied to MLlib operations
  • Various data sources that are accessible via dataframe can now be directly utilized in MLlib
  • Feature extraction and transformation can utilize Spark SQL operations
  • Dataframe-based API also provides uniformity both within MLlib operations and across Spark modules and in a very user friendly manner

At an abstract level Spark's ML package implementation can be thought of as a set of UDF operations over datafame.

Before moving further with Spark's implementation of machine learning, let's first understand machine learning as a field of study and its related concepts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.217.5