Summary

In this chapter, we delved even deeper into machine learning and found out how we can take advantage of machine learning to cluster records belonging to a dataset of unsupervised observations. Consequently, you learnt the practical know-how needed to quickly and powerfully apply supervised and unsupervised techniques on available data to new problems through some widely used examples based on the understandings from the previous chapters. The examples we are talking about will be demonstrated from the Spark perspective. For any of the K-means, bisecting K-means, and Gaussian mixture algorithms, it is not guaranteed that the algorithm will produce the same clusters if run multiple times. For example, we observed that running the K-means algorithm multiple times with the same parameters generated slightly different results at each run.

For a performance comparison between K-means and Gaussian mixture, see Jung. et. al and cluster analysis lecture notes. In addition to K-means, bisecting K-means, and Gaussian mixture, MLlib provides implementations of three other clustering algorithms, namely, PIC, LDA, and streaming K-means. One thing is also worth mentioning is that to fine tune clustering analysis, often we need to remove unwanted data objects called outlier or anomaly. But using distance based clustering it's really difficult to identify such data pints. Therefore, other distance metrics other than Euclidean can be used. Nevertheless, these links would be a good resource to start with:

  1. https://mapr.com/ebooks/spark/08-unsupervised-anomaly-detection-apache-spark.html
  2. https://github.com/keiraqz/anomaly-detection
  3. http://www.dcc.fc.up.pt/~ltorgo/Papers/ODCM.pdf

In the next chapter, we will dig even deeper into tuning Spark applications for better performance. We will see some best practice to optimize the performance of Spark applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.95.248