Summary

Both supervised and unsupervised learning methods share common concerns with respect to noisy data, high dimensionality, and demands on memory and time as the size of data grows. Other issues peculiar to unsupervised learning, due to the lack of ground truth, are questions relating to subjectivity in the evaluation of models and their interpretability, effect of cluster boundaries, and so on.

Feature reduction is an important preprocessing step that mitigates the scalability problem, in addition to presenting other advantages. Linear methods such as PCA, Random Projection, and MDS, each have specific benefits and limitations, and we must be aware of the assumptions inherent in each. Nonlinear feature reduction methods include KPCA and Manifold learning.

Among clustering algorithms, k-Means is a centroid-based technique initialized by selecting the number of clusters and it is sensitive to the initial choice of centroids. DBSCAN is one of the density-based algorithms that does not need initializing with number of clusters and is robust against noise and outliers. Among the probabilistic-based techniques are Mean Shift, which is deterministic and robust to noise, and EM/GMM, which performs well with all types of features. Both Mean Shift and EM/GMM tend to have scalability problems.

Hierarchical clustering is a powerful method involving building binary trees that iteratively groups data points until a similarity threshold is reached. Tolerance to noise depends on the similarity metric used. SOM is a two-layer neural network, allowing visualization of clusters in a 2-D grid. Spectral clustering treats the dataset as a connected graph and identifies clusters by graph partitioning. Affinity propagation, another graph-based technique, uses message passing between data points as affinities to detect clusters.

The validity and usefulness of clustering algorithms is demonstrated using various validation and evaluation measures. Internal measures have no access to ground truth; when labels are available, external measures can be used. Examples of internal measures are Silhouette index and Davies-Bouldin index. Rand index and F-measure are external evaluation measures.

Outlier and anomaly detection is an important area of unsupervised learning. Techniques are categorized as Statistical-based, Distance-based, Density-based, Clustering-based, High-dimensional-based, and One Class SVM. Outlier evaluation techniques include supervised evaluation, where ground truth is known, and unsupervised evaluation, when ground truth is not known.

Experiments using the SMILE Java API and Elki toolkit illustrate the use of the various clustering and outlier detection techniques on the MNIST6000 handwritten digits dataset. Results from different evaluation techniques are presented and compared.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.239.48