MLlib

MLlib is one of the flagship components of the Spark ecosystem. It provides a scalable, high-performance interface to perform resource intensive machine learning tasks in Spark. Additionally, MLlib can natively connect to HDFS, HBase, and other underlying storage systems supported in Spark. Due to this versatility, users do not need to rely on a pre-existing Hadoop environment to start using the algorithms built into MLlib. Some of the supported algorithms in MLlib include:

Classification: logistic regression
Regression: generalized linear regression, survival regression and others
Decision trees, random forests, and gradient-boosted trees
Recommendation: Alternating least squares
Clustering: K-means, Gaussian mixtures and others
Topic modeling: Latent Dirichlet allocation
Apriori: Frequent Itemsets, Association Rules

ML workflow utilities include:

Feature transformations: Standardization, normalization and others
ML Pipeline construction
Model evaluation and hyper-parameter tuning
ML persistence: Saving and loading models and Pipelines

Table of Contents for MLlib

Create new playlist

Sign In

Sign Up

Table of Contents for
MLlib