Challenges in DC algorithm

A distribution-based clustering algorithm like GMM is an expectation-maximization algorithm. To avoid the overfitting problem, GMM usually models the dataset with a fixed number of Gaussian distributions. The distributions are initialized randomly, and the related parameters are iteratively optimized too to fit the model better to the training dataset. This is the most robust feature of GMM and helps the model to be converged toward the local optimum. However, multiple runs of this algorithm may produce different results.

In other words, unlike the bisecting K-means algorithm and soft clustering, GMM is optimized for hard clustering, and in order to obtain of that type, objects are often assigned to the Gaussian distribution. Another advantageous feature of GMM is that it produces complex models of clusters by capturing all the required correlations and dependence between data points and attributes.

On the down-side, GMM has some assumptions about the format and shape of the data, and this puts an extra burden on us (that is, users). More specifically, if the following two criteria do not meet, performance decreases drastically:

  • Non-Gaussian dataset: The GMM algorithm assumes that the dataset has an underlying Gaussian, which is generative distribution. However, many practical datasets do not satisfy this assumption that is subject to provide low clustering performance.
  • If the clusters do not have even sizes, there is a high chance that small clusters will be dominated by larger ones.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.198.170