Chapter 4. Machine Learning Algorithms

In this chapter, we continue with the following topics:

  • Hands on with Spark/MLlib
  • Implement classification and regression algorithms using Spark/MLlib
  • Implement clustering algorithms using Spark/MLlib
  • Using Python for plotting our results

Hands on with Spark/MLlib

In the first chapter, we set up Apache Spark and also discussed an example using Spark/MLlib. Now is the time to explore Spark/MLlib in detail.

The following are the different features provided by Spark/MLlib:

  • Data types: vector, LabeledPoint, matrix, DistributedMatrix (BlockMatrix, RowMatrix, IndexedRowMatrix, and CoordinateMatrix)
  • Statistics: Summary statistics, correlations, stratified sampling, hypothesis testing, random data generation
  • Feature extraction: TF-IDF (HashingTF and IDF), scaling, normalization, and Word2Vec
  • Classification/regression: Linear regression (LinearRegressionWithSGD, LassoWithSGD, and RidgeRegressionWithSGD), logistic regression (SGD, LBFGS), SVM, Naive Bayes, decision tree, ensembles (RandomForests and GradientBostedTrees)
  • Clustering: K-Means, Gaussian mixture model/expectation-maximization, power iteration clustering, LDA (EMLDAOptimizer/OnlineLDAOptimizer), streaming K-Means
  • Association analysis: Frequent pattern mining (FPGrowth)
  • Dimensionality reduction: SVD and PCA
  • Recommendation: Collaborative filtering
  • Pipeline API
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.96.94