Chapter 4. Machine Learning Algorithms

In this chapter, we continue with the following topics:

Hands on with Spark/MLlib

In the first chapter, we set up Apache Spark and also discussed an example using Spark/MLlib. Now is the time to explore Spark/MLlib in detail.

The following are the different features provided by Spark/MLlib:

Data types: vector, LabeledPoint, matrix, DistributedMatrix (BlockMatrix, RowMatrix, IndexedRowMatrix, and CoordinateMatrix)
Statistics: Summary statistics, correlations, stratified sampling, hypothesis testing, random data generation
Feature extraction: TF-IDF (HashingTF and IDF), scaling, normalization, and Word2Vec
Classification/regression: Linear regression (LinearRegressionWithSGD, LassoWithSGD, and RidgeRegressionWithSGD), logistic regression (SGD, LBFGS), SVM, Naive Bayes, decision tree, ensembles (RandomForests and GradientBostedTrees)
Clustering: K-Means, Gaussian mixture model/expectation-maximization, power iteration clustering, LDA (EMLDAOptimizer/OnlineLDAOptimizer), streaming K-Means
Association analysis: Frequent pattern mining (FPGrowth)
Dimensionality reduction: SVD and PCA
Recommendation: Collaborative filtering
Pipeline API

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.