MLlib is one of the flagship components of the Spark ecosystem. It provides a scalable, high-performance interface to perform resource intensive machine learning tasks in Spark. Additionally, MLlib can natively connect to HDFS, HBase, and other underlying storage systems supported in Spark. Due to this versatility, users do not need to rely on a pre-existing Hadoop environment to start using the algorithms built into MLlib. Some of the supported algorithms in MLlib include:
- Classification: logistic regression
- Regression: generalized linear regression, survival regression and others
- Decision trees, random forests, and gradient-boosted trees
- Recommendation: Alternating least squares
- Clustering: K-means, Gaussian mixtures and others
- Topic modeling: Latent Dirichlet allocation
- Apriori: Frequent Itemsets, Association Rules
ML workflow utilities include:
- Feature transformations: Standardization, normalization and others
- ML Pipeline construction
- Model evaluation and hyper-parameter tuning
- ML persistence: Saving and loading models and Pipelines