Spark ML

Spark ML adds a new set of machine learning APIs to let users quickly assemble and configure practical machine learning pipelines on top of datasets. Spark ML aims to offer a uniform set of high-level APIs built on top of DataFrames rather than RDDs that help users create and tune practical machine learning pipelines. Spark ML API standardizes machine learning algorithms to make the learning tasks easier to combine multiple algorithms into a single pipeline or data workflow for data scientists. The Spark ML uses the concepts of DataFrame and Datasets, which are much newer concepts introduced (as experimental) in Spark 1.6 and then used in Spark 2.0+.

In Scala and Java, DataFrame and Dataset have been unified, that is, DataFrame is just a type alias for a dataset of row. In Python and R, given the lack of type safety, DataFrame is the main programming interface.

The datasets hold diverse data types such as columns storing text, feature vectors, and true labels for the data. In addition to this, Spark ML also uses the transformer to transform one DataFrame into another or vice-versa, where the concept of the estimator is used to fit on a DataFrame to produce a new transformer. The pipeline API, on the other hand, can restrain multiple transformers and estimators together to specify an ML data workflow. The concept of the parameter was introduced to specify all the transformers and estimators to share a common API under an umbrella during the development of an ML application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.10.182