Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5
General Clustering Techniques

5.1 Brief Overview of Clustering

As an exploratory tool, clustering techniques are frequently used as a means of identifying features in a data set, such as sets or subsets of observations with certain common (known or unknown) characteristics. The aim is to group observations into clusters (also called classes, or groups, or categories, etc.) that are internally as homogeneously as possible and as heterogeneous as possible across clusters. In this sense, the methods can be powerful tools in an initial data analysis, as a precursor to more detailed in‐depth analyses.

However, the plethora of available methods can be both a strength and a weakness, since methods can vary in what specific aspects are important to that methodology, and since not all methods give the same answers on the same data set. Even then a given method can produce varying results depending on the underlying metrics used in its applications. For example, if a clustering algorithm is based on distance/dissimilarity measures, we have seen from Chapters 3 and that different measures can have different values for the same data set. While not always so, often the distances used are Euclidean distances (see Eq. (3.1.8)) or other Minkowski distances such as the city block distance. Just as no one distance measure is universally preferred, so it is that no one clustering method is generally preferred. Hence, all give differing results; all have their strengths and weaknesses, depending on features of the data themselves and on the circumstances.

In this chapter, we give a very brief description of some fundamental clustering methods and principles, primarily as these relate to classical data sets. How these translate into detailed algorithms, descriptions, and applications methodology for symbolic data is deferred to relevant later chapters. In particular, we discuss clustering as partitioning methods in section 5.2 (with the treatment for symbolic data in Chapter 6 ) and hierarchical methods in section 5.3 (and in Chapter 7 for divisive clustering and Chapter 8 for agglomerative hierarchies for symbolic data). Illustrative examples are provided in section 5.4, where some of these differing methods are applied to the same data set, along with an example using simulated data to highlight how classically‐based methods ignore the internal variations inherent to symbolic observations. The chapter ends with a quick look at other issues in section 5.5. While their interest and focus is specifically targeted at the marketing world, Punj and Stewart (1983) provide a nice overview of the then available different clustering methods, including strengths and weaknesses, when dealing with classical realizations; Kuo et al. (2002) add updates to that review. Jain et al. (1999) and Gordon (1999) provide overall broad perspectives, mostly in connection with classical data but do include some aspects of symbolic data.

We have observations with realizations described by , , in ‐dimensional space . In this chapter, realizations can be classical or symbolic‐valued.

5.2 Partitioning

Partitioning methods divide the data set of observations into non‐hierarchical, i.e., not nested, clusters. There are many methods available, and they cover the gamut of unsupervised or supervised, agglomerative or divisive, and monothetic or polythetic methodologies. Lance and Williams (1967a) provide a good general introduction to the principles of partitions. Partitioning algorithms typically use the data as units (in ‐means type methods) or distances (in ‐medoids type methods) for clustering purposes.

The basic partitioning method has essentially the following steps:

Establish an initial partition of a pre‐specified number of clusters .
Determine classification criteria.
Determine a reallocation process by which each observation in can be reassigned iteratively to a new cluster according to how the classification criteria are satisfied; this includes reassignment to its current cluster.
Determine a stopping criterion.

The convergence of the basic partitioning method can be formalized as follows. Let be a quality criterion defined on any partition of the initial population, and let be defined as the quality criterion on the cluster , , with

(5.2.1)

Both and have positive values, with the internal variation decreasing as the homogeneity increases so that the quality of the clusters increases. At the th iteration, we start with the partition . One observation from cluster , say, is reallocated to another cluster , say, so that the new partition decreases the overall criterion . Hence, in this manner, at each step decreases and converges since .

There are many suggestions in the literature as to how the initial clusters might be formed. Some start with seeds, others with clusters. Of those that use seeds, representative methods, summarized by Anderberg (1973) and Cormack (1971), include:

Select any observations from subjectively.
Randomly select observations from .
Calculate the centroid (defined, e.g., as the mean of the observation values by variable) of as the first seed and then select further seeds as those observations that are at least some pre‐specified distance from all previously selected seeds.
Choose the first observations in .
Select every th observation in .

For those methods that start with initial partitions, Anderberg's (1973) summary includes:

Randomly allocate the observations in into clusters.
Use the results of a hierarchy clustering.
Seek the advice of experts in the field.

Other methods introduce “representative/supplementary” data as seeds. We observe that different initial clusters can lead to different final clusters, especially when, as is often the case, the algorithm used is based on minimizing an error sum of squares as these methods can converge to local minima instead of global minima. Indeed, it is not possible to guarantee that a global minima is attained unless all possible partitions are explored, which detail is essentially impossible given the very large number of possible clusters. It can be shown that the number of possible clusters from observations is the Stirling number of the second kind,

Thus, for example, suppose . Then, when , , and when , .

The most frequently used partitioning method is the so‐called ‐means algorithm and its variants introduced by Forgy (1965), MacQueen (1967), and Wishart (1969). The method calculates the mean vector, or centroid, value for each of the clusters. Then, the iterative reallocation step involves moving each observation, one at a time, to that cluster whose centroid value is closest; this includes the possibility that the current cluster is “best” for a given observation. When, as is usually the case, a Euclidean distance (see Eq. (3.1.8)) is used to measure the distance between an observation and the means, this implicitly minimizes the variance within each cluster. Forgy's method is non‐adaptive in that it recalculates the mean vectors after a complete round (or pass) on all observations has been made in the reallocation step. In contrast, MacQueen's (1967) approach is adaptive since it recalculates the centroid values after each individual observation has been reassigned during the reallocation step. Another distinction is that Forgy continues the reiteration of these steps (of calculating the centroids and then passing through all observations at the reallocation step) until convergence, defined as when no further changes in the cluster compositions occur. On the other hand, MacQueen stops after the second iteration.

These ‐means methods have the desirable feature that they converge rapidly. Both methods have piece‐wise linear boundaries between the final cluster. This is not necessarily so for symbolic data, since, for example, interval data produce observations that are hypercubes in rather than the points of classically valued observations; however, some equivalent “linearity” concept would prevail. Another variant uses a so‐called forcing pass which tests out every observation in some other cluster in an effort to avoid being trapped in a local minima, a short‐coming byproduct of these methods (see Friedman and Rubin (1967)). This approach requires an accommodation for changing the number of clusters . This is particularly advantageous when outliers exist in the data set. Other variants involve rules as to when the iteration process stops, when, if at all, a reallocation process is implemented on the final set of clusters, and so on. Overall then, no completely satisfactory method exists. A summary of some of the many variants of the ‐means approach is in Jain (2010).

While the criteria driving the reallocation step can vary from one approach to another, how this is implemented also changes from one algorithm to another. Typically though, this is achieved by a hill‐climbing steepest descent method or a gradient method such as is associated with Newton–Raphson techniques. This ‐means method is now a standard method widely used and available in many computer packages.

A more general partitioning method is the “dynamical clustering” method. This method is based on a “representation function” () of any partition with . This associates a representation with each cluster of the partition . Then, given , an allocation function with associates a cluster of a new partition with each representation of of . For example, for the case of a representation by density functions (, each individual is then allocated to the cluster if for this individual. In the case of representation by regression, each individual is allocated to the cluster if this individual has the best fit to the regression , etc. By alternating the use of these and functions, the method converges.

More formally, starting with the partition of the population , the representation function produces a vector of representation such that . Then, the allocation function produces a new partition such that . Eq. 5.2.1 can be generalized to

(5.2.2)

where is a measure of the “fit” between each cluster and its representation and where decreases as the fit increases. For example, if is a distribution function, then could be the inverse of the likelihood of the cluster for distribution .

That convergence of this more general dynamical clustering partitioning method pertains can be proved as follows. Suppose at the th iteration we have the partition and suppose the representation vector is , and let be the quality measure on at this th iteration. First, by the reallocation of each individual belonging to a cluster in to a new cluster such that , the new partition obtained by this reallocation improves the fit with in order that we have , . This implies

(5.2.3)

After this th iteration, we can now define at the next th iteration a new representation with and where we now have a better fit than we had before for the th iteration. This means that , . Hence,

(5.2.4)

Therefore, since decreases at each iteration, it must be converging. We note that a simple condition for the sequence to be decreasing is that for any partition and any representation . Thus, in the reallocation process, a reallocated individual which decreases the criteria the most will by definition be that individual which gives the best fit.

As special cases, when is the mean of cluster , we have the ‐means algorithm, and when is the medoid, we have the ‐medoid algorithm. Notice then that this implies that the and of Eq. (6.1.5) correspond, respectively, to the and of Eq. 5.2.1. Indeed, the representation of any cluster can be a density function (see Diday and Schroeder (1976)), a regression (see Charles (1977) and Liu (2016)), or more generally a canonical analysis (see Diday (1978, 1986a)), a factorial axis (see Diday (1972a,b)), a distance (see, e.g., Diday and Govaert (1977)), a functional curve (Diday and Simon (1976)), points of a population (Diday (1973)), and so on. For an overview, see Diday (1979), Diday and Simon (1976), and Diday (1989).

Many names have been used for this general technique. For example, Diday (1971a,b, 1972b) refers to this as a “nuées dynamique” method, while in Diday (1973), Diday et al. (1974), and Diday and Simon (1976), it is called a “dynamic clusters” technique, and Jain and Dubes (1988) call it a “square error distortion” method. Bock (2007) calls it an “iterated minimum partition” method and Anderberg (1973) refers to it as a “nearest centroid sorting” method; these latter two descriptors essentially reflect the actual procedure as described in Chapter 6. This dynamical partitioning method can also be found in, e.g., Diday et al. (1974), Schroeder (1976), Scott and Symons (1971), Diday (1979), Celeux and Diebolt (1985), Celeux et al. (1989), Symons (1981), and Ralambondrainy (1995).

Another variant is the “self‐organizing maps” (SOM) method, or “Kohonen neural networks” method, described by Kohonen (2001). This method maps patterns onto a grid of units or neurons. In a comparison of the two methods, Bacão et al. (2005) concludes that the SOM method is less likely than the ‐means method to settle on local optima, but that the SOM method converges to the same result as that obtained by the ‐means method . Hubert and Arabie (1985) compare some of these methods.

While the ‐means method uses input coordinate values as its data, the ‐medoids method uses dissimilarities/distances as its input. This method is also known as the partitioning around mediods (PAM) method developed by Kaufman and Rousseeuw (1987) and Diday (1979). The algorithm works in the same way as does the ‐means algorithm but moves observations into clusters based on their dissimilarities/distances from cluster medoids (instead of cluster means). The medoids are representative but actual observations inside the cluster. Some variants use the median, midrange, or mode value as the medoid. Since the basic input elements are dissimilarities/distances, the method easily extends to symbolic data, where now dissimilarities/distances can be calculated using the results of Chapter 3s and . The PAM method is computationally intensive so is generally restricted to small data sets. Kaufman and Rousseeuw (1986) developed a version for large data sets called clustering large applications (CLARA). This version takes samples from the full data set, applies PAM to each sample, and then amalgamates the results.

As a different approach, Goodall (1966) used a probabilistic similarity matrix to determine the reallocation step, while other researchers based this criterion on multivariate statistical analyses such as discriminant analysis. Another approach is to assume observations arise from a finite mixture of distributions, usually multivariate normal distributions. These model‐based methods require the estimation, often by maximum likelihood methods or Bayesian methods, of model parameters (mixture probabilities, and the distributional parameters such as means and variances) at each stage, before the partitioning process continues, as in, for example, the expectation‐maximization (EM) algorithm of Dempster et al. (1977). See, for example, McLachlan and Peel (2000), McLachlan and Basford (1988), Celeux and Diebolt (1985), Celeux and Govaert (1992, 1995), Celeux et al. (1989), and Banfield and Raftery (1993).

5.3 Hierarchies

In contrast to partitioning methods, hierarchical methods construct hierarchies which are nested clusters. Usually, these are depicted pictorially as trees or dendograms, thereby showing the relationships between the clusters, as in Figure 5.1. In this chapter, we make no distinction between monothetic methods (which focus on one variable at a time) or polythetic methods (which use all variables simultaneously) (see Chapter 7 , where the distinctions are made).

Note this definition of the hierarchy includes all the nodes of the tree. In contrast, the final clusters only, with none of the intermediate sub‐cluster/nodes, can be viewed as the bi‐partition (although the construction of a hierarchy differs from that for a partition ). Thus for a partition, the intersection of two clusters is empty (e.g., clusters and of Figure 5.1) whereas the intersection of two nested clusters in a hierarchy is not necessarily empty (e.g., clusters and of Figure 5.1).

Illustration depicting clusters C subscript 1 to C subscript 6 with the intersection of two clusters which is empty (C subscript 1 and C subscript 2), whereas the intersection of two nested clusters in a hierarchy is not empty (clusters C subscript 1 and C subscript 3). — Figure 5.1 Clusters as a partition at the end of a hierarchy.

images — Figure 5.1 Clusters as a partition at the end of a hierarchy.

Illustration depicting nodes-clusters C subscript 1 to C subscript 11, with cluster C subscript 1 divided into C subscript 2 and C subscript 3 further divided into four clusters each. — Figure 5.2 Nodes‐clusters of a hierarchy.

Hierarchical methodologies fall into two broad classes. Divisive clustering starts with the tree root containing all observations and divides this into two sub‐clusters and and so on, down to a final set of clusters each with one observation. Thus, with reference to Figure 5.2, we start with one cluster (containing observations) which is divided into and , here each of three observations. Then, one of these (in this case ) is further divided into, here, and . Then, cluster is divided into and . The process continues until there are six clusters each containing just one observation.

Agglomerative methods, on the other hand, start with clusters, with each observation being a one‐observation cluster at the bottom of the tree with successive merging of clusters upwards, until the final cluster consisting of all the observations () is obtained. Thus, with reference to Figure 5.2, each of the six clusters contains one observation only. The two clusters and are merged to form cluster . Then, and merge to form cluster . The next merging is when the clusters and merge to give , or equivalently, at this stage (in two steps) the original single observation clusters , , and form the three‐observation cluster of . This process continues until all six observations in form the root node .

Of course, for either of these methods, the construction can stop at some intermediate stage, such as between the formation of the clusters and in Figure 5.2 (indicated by the dotted line). In this case, the process stops with the four clusters , , , and . Indeed, such intermediate clusters may be more informative than might be a final complete tree with only one‐observation clusters as its bottom layer. Further, these interim clusters can serve as valuable input to the selection of initial clusters in a partitioning algorithm.

At a given stage, how we decide which two clusters are to be merged in the agglomerative approach, or which cluster is to be bi‐partitioned, i.e., divided into two clusters in a divisive approach, is determined by the particular algorithm adopted by the researcher. There are many possible criteria used for this.

The motivating principle behind any hierarchical construction is to divide the data set into clusters which are internally as homogeneous as possible and externally as heterogeneous as possible. While some algorithms implicitly focus on maximizing the between‐cluster heterogeneities, and others focus on minimizing the within‐cluster homogeneities exclusively, most aim to balance both the between‐cluster variations and within‐cluster variations.

Thus, divisive methods search to find the best bi‐partition at each stage, usually using a sum of squares approach which minimizes the total error sum of squares (in a classical setting) or a total within‐cluster variation (in a symbolic setting). More specifically, the cluster with observations has a total within‐cluster variation of (say), described in Chapter 7 (Definition 7.1, Eq. (7.1.2)), and we want to divide the observations into two disjoint sub‐clusters and so that the total within‐cluster variations of both and satisfy . How this is achieved varies with the different algorithms, but all seek to find that division which minimizes . However, since these entities are functions of the distances (or dis/similarities) between observations in a cluster, different dissimilarities/distance functions will also produce differing hierarchies in general (though some naturally will be comparable for different dissimilarities/distance functions and/or different algorithms) (see Chapters 3 and ).

In contrast, agglomerative methods seek to find the best merging of two clusters at each stage. How “best” is defined varies by the method, but typically involves a concept of merging the two “closest” observations. Most involve a dis/similarity (or distance) measure in some format. These include the following:

Single‐link (or “nearest neighbor”) methods, where two clusters and are merged if the dissimilarity between them is the minimum of all such dissimilarities across all pairs and , . Thus, with reference to Figure 5.2, at the first step, since each cluster corresponds to a single observation , this method finds those two observations for which the dissimilarity/distance measure, , is minimized; here and form . At the next step, the process continues by finding the two clusters which have the minimum dissimilarity/distance between them. See Sneath and Sokal (1973).
Complete‐link (or “farthest neighbor”) methods are comparable to single‐link methods except that it is the maximum dissimilarity/distance at each stage that determines the best merger, instead of the minimum dissimilarity/distance. See Sneath and Sokal (1973).
Ward's (or “minimum variance”) methods, where two clusters, of the set of clusters, are merged if they have the smallest squared Euclidean distance between their cluster centroids. This is proportional to a minimum increase in the error sum of squares for the newly merged cluster (across all potential mergers). See Ward (1963).
There are several others, including average‐link (weighted average and group average), median‐link, flexible, centroid‐link, and so on. See, e.g, Anderberg (1973), Jain and Dubes (1988), McQuitty (1967), Gower (1971), Sokal and Michener (1958), Sokal and Sneath (1963), MacNaughton‐Smith et al. (1964), Lance and Williams (1967b) and Gordon (1987, 1999).

Clearly, as for divisive methods, these methods involve dissimilarity (or distance) functions in some form. A nice feature of the single‐link and complete‐link methods is that they are invariant to monotonic transformations of the dissimilarity matrices. The medium‐link method also retains this invariance property; however, it has a problem in implementation in that ties can frequently occur. The average‐link method loses this invariance property. The Ward criterion can retard the growth of large clusters, with outlier points tending to be merged at earlier stages of the clustering process. When the dissimilarity matrices are ultrametric (see Definition 3.4), single‐link and complete‐link methods produce the same tree. Furthermore, an ultrametric dissimilarity matrix will always allow a hierarchy tree to be built. Those methods based on Euclidean distances are invariant to orthogonal rotation of the variable axes.

In general, ties are a problem, and so usually algorithms assume no ties exist. However, ties can be broken by adding an arbitrarily small value to one distance; Jardine and Sibson (1971) show that, for the single‐link method, the results merge to the same tree as this perturbation . Also, reversals can occur for median and centroid constructions, though they can be prevented when the distance between two clusters exceeds the maximum tree height for each separate cluster (see, e.g., Gordon (1987)).

Gordon (1999), Punj and Stewart (1983), Jain and Dubes (1988), Anderberg (1973), Cormack (1971), Kaufman and Rousseeuw (1990), and Jain et al. (1999), among many other sources, provide an excellent coverage and comparative review of these methods, for both divisive and agglomerative approaches. While not all such comparisons are in agreement, by and large complete‐link methods are preferable to single‐link methods, though complete‐link methods have more problems with ties. The real agreement between these studies is that there is no one best dissimilarity/distance measure, no one best method, and no one best algorithm, though Kaufman and Rousseeuw (1990) argue strenuously that the group average‐link method is preferable. It all depends on the data at hand and the purpose behind the construction itself. For example, suppose in a map of roads it is required to find a road system, then a single‐link would likely be wanted, whereas if it is a region where types of housing clusters are sought, then a complete‐link method would be better. Jain et al. (1999) suggest that complete‐link methods are more useful than single‐link methods for many applications. In statistical applications, minimum variance methods tend to be used, while in pattern recognition, single‐link methods are more often used. Also, a hierarchy is in one‐to‐one correspondence with an ultrametric dissimilarity matrix (Johnson, 1967).

A pyramid is a type of agglomerative construction where now clusters can overlap. In particular, any given object can appear in two overlapping clusters. However, that object cannot appear in three or more clusters. This implies there is a linear order of the set of objects in . To construct a pyramid, the dissimilarity matrix must be a Robinson matrix (see Definition 3.6). Whereas a partition consists of non‐overlapping/disjoint clusters and a hierarchy consists of nested non‐overlapping clusters, a pyramid consists of nested overlapping clusters. Diday (1984, 1986b) introduces the basic principles, with an extension to spatial pyramids in Diday (2008). They are constructed from dissimilarity matrices by Diday (1986b), Diday and Bertrand (1986), and Gaul and Schader (1994).

For most methods, it is not possible to be assured that the resulting tree is globally optimal, only that it is step‐wise optimal. Recall that at each step/stage of the construction, an optimal criterion is being used; this does not necessarily guarantee optimally for the complete tree. Also, while step‐wise optimality pertains, once that stage is completed (be this up or down the tree, i.e., divisive or agglomerative clustering), there is no allowance to go back to any earlier stage.

There are many different formats for displaying a hierarchy tree; Figure 5.1 is but one illustration. Gordon (1999) provides many representations. For each representation, there are different orders of the tree branches of the same hierarchy of objects. Thus, for example, a hierarchy on the three objects can take any of the forms shown in Figure 5.3. However, a tree with the object 3 as a branch in‐between the branches 1 and 2 is not possible. In contrast, there are only two possible orderings if a pyramid is built on these three objects, namely, and .

Illustration depicting the construction of different orders for a hierarchy of three objects - 1, 2, 3, called a dendogram. — Figure 5.3 Different orders for a hierarchy of objects .

A different format is a so‐called dendogram, which is a hierarchy tree on which a measure of “height” has been added at each stage of the tree's construction.

It is easy to show that (Gordon, 1999), for any two observations in with realizations , , the height satisfies the ultrametric property, i.e.,

(5.3.1)

One such measure is where is the explained rate (i.e., the proportion of total variation in explained by the tree) at stage of its construction (see section 7.4 for details).

There are many divisive and agglomerative algorithms in the literature for classical data, but few have been specifically designed for symbolic‐valued observations. However, those (classical) algorithms based on dissimilarities/distances can sometimes be extended to symbolic data where now the symbolic dissimilarities/distances between observations are utilized. Some such extensions are covered in Chapters 7 and 8.

5.4 Illustration

We illustrate some of the clustering methods outlined in this chapter. The first three provide many different analyses on the same data set, showing how different methods can give quite different results. The fourth example shows how symbolic analyses utilizing internal variations are able to identify clusters that might be missed when using a classical analysis on the same data.

Example 5.1

The data in Table 5.1 were obtained by aggregating approximately 50000 classical observations, representing measurements observed for airlines for flights arriving at and/or departing from a major airport hub. Of particular interest are the variables, = flight time between airports (airtime), = time to taxi in to the arrival gate after landing (taxi‐in time), = time flight arrived after its scheduled arrival time (arrival‐delay time), = time taken to taxi out from the departure gate until liftoff (taxi‐out time), and = time flight was delayed after its scheduled departure time (departure‐delay time). The histograms for each variable and each airline were found using standard methods and packages. Thus, there were histogram sub‐intervals for each of and , sub‐intervals for and , and for . The relative frequencies for each , for each airline , are displayed in Table 5.1. Flights that were delayed in some manner by weather were omitted in this aggregation. The original data were extracted from Falduti and Taibaly (2004).

Table 5.1 Flights data (Example 5.1)

	[0, 70)	[70, 110)	[110, 150)	[150, 190)	[190, 230)	[230, 270)	[270, 310)	[310, 350)	[350, 390)	[390, 430)	[430, 470)	[470, 540]
1	0.00017	0.10568	0.33511	0.20430	0.12823	0.045267	0.07831	0.07556	0.02685	0.00034	0.00000	0.00000
2	0.24826	0.54412	0.20365	0.00397	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
3	0.77412	0.22451	0.00137	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
4	0.16999	0.00805	0.42775	0.20123	0.01955	0.02606	0.00556	0.00556	0.00057	0.01169	0.09774	0.02626
5	0.13464	0.10799	0.01823	0.37728	0.35063	0.01122	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
6	0.70026	0.22415	0.07264	0.00229	0.00065	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
7	0.26064	0.21519	0.34916	0.06427	0.02798	0.01848	0.03425	0.02272	0.00729	0.00000	0.00000	0.00000
8	0.17867	0.41499	0.40634	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
9	0.28907	0.41882	0.28452	0.00683	0.00076	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
10	0.00000	0.00000	0.00000	0.00000	0.03811	0.30793	0.34299	0.21494	0.08384	0.01220	0.00000	0.00000
11	0.51329	0.35570	0.11021	0.01651	0.00386	0.00000	0.00021	0.00000	0.00000	0.00000	0.00000	0.00000
12	0.39219	0.31956	0.19201	0.09442	0.00182	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
13	0.00000	0.61672	0.36585	0.00348	0.00174	0.00000	0.00523	0.00174	0.00348	0.00000	0.00000	0.00174
14	0.07337	0.28615	0.19896	0.05956	0.04186	0.07898	0.10876	0.10315	0.04877	0.00043	0.00000	0.00000
15	0.76391	0.20936	0.01719	0.00645	0.00263	0.00048	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000
16	0.48991	0.06569	0.26276	0.10803	0.02374	0.02810	0.01543	0.00594	0.00040	0.00000	0.00000	0.00000
	[0, 5)	[5, 10)	[10, 15)	[15, 20)	[20, 25)	[25, 30)	[30, 35)	[35, 40)	[40, 50)	[50, 75)	[75, 100)	[100, 125]
1	0.38451	0.37194	0.12289	0.04647	0.02513	0.01256	0.01015	0.00706	0.00929	0.00740	0.00207	0.00052
2	0.53916	0.32659	0.07813	0.02697	0.01150	0.00555	0.00278	0.00278	0.00377	0.00238	0.00040	0.00000
3	0.46365	0.45450	0.05853	0.01189	0.00640	0.00229	0.00137	0.00046	0.00091	0.00000	0.00000	0.00000
4	0.70353	0.23745	0.03526	0.01418	0.00364	0.00307	0.00134	0.00077	0.00057	0.00019	0.00000	0.00000
5	0.32539	0.49509	0.13604	0.03086	0.00982	0.00140	0.00140	0.00000	0.00000	0.00000	0.00000	0.00000
6	0.63514	0.24280	0.04647	0.02127	0.01734	0.01014	0.00687	0.00458	0.00622	0.00687	0.00098	0.00131
7	0.33017	0.47584	0.12142	0.03748	0.01865	0.00543	0.00339	0.00187	0.00288	0.00237	0.00034	0.00017
8	0.32277	0.46398	0.10663	0.03458	0.03458	0.00865	0.01441	0.00288	0.00288	0.00576	0.00288	0.00000
9	0.28907	0.51897	0.13050	0.02883	0.01214	0.01214	0.00531	0.00000	0.00228	0.00076	0.00000	0.00000
10	0.42378	0.32470	0.11433	0.04878	0.02591	0.02591	0.00610	0.00762	0.00915	0.00915	0.00000	0.00457
11	0.42152	0.37500	0.11321	0.04245	0.01672	0.01158	0.00793	0.00343	0.00386	0.00322	0.00107	0.00000
12	0.23559	0.60009	0.11621	0.02860	0.00908	0.00726	0.00227	0.00045	0.00000	0.00045	0.00000	0.00000
13	0.33972	0.46167	0.08711	0.03136	0.02787	0.02265	0.00174	0.00697	0.00697	0.01045	0.00348	0.00000
14	0.59732	0.29650	0.07078	0.01640	0.00691	0.00647	0.00302	0.00043	0.00129	0.00043	0.00000	0.00043
15	0.61733	0.32442	0.04201	0.00907	0.00430	0.00095	0.00095	0.00024	0.00048	0.00000	0.00024	0.00000
16	0.93154	0.05738	0.00594	0.00277	0.00119	0.00079	0.00040	0.00000	0.00000	0.00000	0.00000	0.00000
	[40, 20)	[20, 0)	[0, 20)	[20, 40)	[40, 60)	[60, 80)	[80, 100)	[100, 120)	[120, 140)	[140, 160)	[160, 200)	[200, 240]
1	0.09260	0.38520	0.28589	0.09725	0.04854	0.03046	0.01773	0.01411	0.00637	0.00654	0.01532	0.00000
2	0.14158	0.45588	0.24886	0.07238	0.03272	0.01745	0.01170	0.00595	0.00377	0.00377	0.00595	0.00000
3	0.04298	0.44810	0.31367	0.09419	0.04161	0.02515	0.01280	0.00686	0.00640	0.00320	0.00503	0.00000
4	0.07359	0.41989	0.31909	0.11154	0.04312	0.01763	0.00652	0.00422	0.00211	0.00077	0.00153	0.00000
5	0.09537	0.45863	0.30014	0.07433	0.03226	0.01683	0.01403	0.00281	0.00281	0.00000	0.00281	0.00000
6	0.12958	0.41361	0.21008	0.09097	0.04450	0.02716	0.02094	0.01440	0.01276	0.00884	0.02716	0.00000
7	0.06054	0.44362	0.33475	0.08648	0.03510	0.01865	0.00797	0.00661	0.00356	0.00051	0.00220	0.00000
8	0.08934	0.44957	0.29683	0.07493	0.01729	0.03746	0.00865	0.00576	0.00576	0.00576	0.00865	0.00000
9	0.07967	0.36646	0.28376	0.10698	0.06070	0.03794	0.02883	0.00835	0.01366	0.00835	0.00531	0.00000
10	0.14024	0.30030	0.29573	0.18293	0.03659	0.01067	0.00762	0.00305	0.00152	0.00762	0.01372	0.00000
11	0.07955	0.34627	0.27980	0.12393	0.06411	0.03967	0.02423	0.01844	0.00922	0.00493	0.00986	0.00000
12	0.03949	0.40899	0.33727	0.12483	0.04585	0.02224	0.00817	0.00635	0.00227	0.00136	0.00318	0.00000
13	0.07840	0.44599	0.21603	0.10627	0.04530	0.03310	0.01916	0.01394	0.00871	0.01220	0.02091	0.00000
14	0.07682	0.41951	0.27147	0.09840	0.03712	0.03151	0.01942	0.01122	0.00950	0.00604	0.01899	0.00000
15	0.10551	0.55693	0.22989	0.06493	0.02363	0.01074	0.00286	0.00143	0.00167	0.00095	0.00143	0.00000
16	0.05857	0.58726	0.23823	0.05540	0.02928	0.01187	0.01029	0.00317	0.00198	0.00079	0.00317	0.00000
	[0, 8)	[8, 16)	[16, 24)	[24, 32)	[32, 40)	[40, 48)	[48, 56)	[56, 64)	[64, 72)	[72, 80)	[80, 88)	[88, 96]
1	0.01687	0.43580	0.30878	0.12513	0.05370	0.02496	0.01136	0.00740	0.00499	0.01102	0.00000	0.00000
2	0.09974	0.40472	0.25223	0.13405	0.05870	0.02499	0.01190	0.00516	0.00357	0.00496	0.00000	0.00000
3	0.05761	0.41198	0.26932	0.14769	0.05624	0.03109	0.01280	0.00503	0.00137	0.00686	0.00000	0.00000
4	0.02779	0.57877	0.24722	0.08317	0.03910	0.01016	0.00479	0.00287	0.00211	0.00402	0.00000	0.00000
5	0.00561	0.33941	0.32959	0.17251	0.08555	0.03787	0.02104	0.00421	0.00421	0.00000	0.00000	0.00000
6	0.10995	0.46073	0.23920	0.10046	0.04221	0.01963	0.00916	0.00687	0.00393	0.00785	0.00000	0.00000
7	0.00967	0.37850	0.33593	0.14601	0.06427	0.02900	0.01509	0.00627	0.00526	0.01001	0.00000	0.00000
8	0.08934	0.46398	0.26225	0.10663	0.04900	0.01153	0.00865	0.00576	0.00288	0.00000	0.00000	0.00000
9	0.01897	0.34901	0.31259	0.16313	0.08877	0.03642	0.01669	0.00986	0.00152	0.00303	0.00000	0.00000
10	0.03963	0.39634	0.39024	0.10366	0.04116	0.01372	0.00610	0.00457	0.00152	0.00305	0.00000	0.00000
11	0.06003	0.42024	0.25772	0.13572	0.06111	0.02830	0.01715	0.00686	0.00536	0.00750	0.00000	0.00000
12	0.03404	0.45166	0.26146	0.11484	0.06128	0.03223	0.01816	0.01044	0.00681	0.00908	0.00000	0.00000
13	0.01742	0.38502	0.28920	0.15157	0.08537	0.04704	0.01568	0.00697	0.00000	0.00174	0.00000	0.00000
14	0.00691	0.33880	0.34916	0.15451	0.07682	0.02935	0.01942	0.00604	0.00604	0.01295	0.00000	0.00000
15	0.08761	0.53927	0.21246	0.09048	0.03438	0.01337	0.00668	0.00573	0.00263	0.00740	0.00000	0.00000
16	0.41472	0.48833	0.06055	0.02137	0.00831	0.00198	0.00119	0.00040	0.00119	0.00198	0.00000	0.00000
	[15, 5)	[5, 25)	[25, 45)	[45, 65)	[65, 85)	[85, 105)	[105, 125)	[125, 145)	[145, 165)	[165, 185)	[185, 225)	[225, 265]
1	0.67762	0.16988	0.05714	0.03219	0.01893	0.01463	0.00878	0.00000	0.00361	0.00947	0.00775	0.00000
2	0.77414	0.12552	0.04144	0.01943	0.01368	0.00654	0.00694	0.00000	0.00416	0.00476	0.00337	0.00000
3	0.78235	0.10700	0.04161	0.02469	0.01783	0.01143	0.00549	0.00000	0.00366	0.00320	0.00274	0.00000
4	0.59448	0.27386	0.08317	0.02626	0.01016	0.00575	0.00287	0.00000	0.00134	0.00115	0.00096	0.00000
5	0.84993	0.07293	0.03086	0.01964	0.01683	0.00421	0.00281	0.00000	0.00000	0.00140	0.00140	0.00000
6	0.65249	0.14071	0.06872	0.04025	0.02749	0.01669	0.01407	0.00000	0.01014	0.01407	0.01538	0.00000
7	0.77650	0.14516	0.04036	0.01611	0.01051	0.00526	0.00305	0.00000	0.00085	0.00068	0.00153	0.00000
8	0.63112	0.24784	0.04323	0.02017	0.02882	0.00288	0.00865	0.00000	0.00865	0.00000	0.00865	0.00000
9	0.70030	0.12064	0.06297	0.04628	0.02049	0.01290	0.01897	0.00000	0.00986	0.00607	0.00152	0.00000
10	0.73323	0.16463	0.04726	0.01677	0.01220	0.00305	0.00457	0.00000	0.00152	0.00762	0.00915	0.00000
11	0.64537	0.15866	0.07654	0.04867	0.02744	0.01780	0.00858	0.00000	0.00686	0.00686	0.00322	0.00000
12	0.78711	0.12165	0.05311	0.01816	0.00772	0.00635	0.00227	0.00000	0.00136	0.00045	0.00182	0.00000
13	0.71080	0.12369	0.05749	0.03310	0.01916	0.00523	0.01045	0.00000	0.01742	0.01394	0.00871	0.00000
14	0.74234	0.09754	0.05697	0.03064	0.02201	0.01381	0.01252	0.00000	0.00216	0.00993	0.01208	0.00000
15	0.83600	0.10862	0.03032	0.01408	0.00573	0.00286	0.00095	0.00000	0.00072	0.00048	0.00024	0.00000
16	0.76573	0.13850	0.04432	0.02335	0.01464	0.00594	0.00356	0.00000	0.00079	0.00119	0.00198	0.00000

For each airline, :

= flight time between airports (air time); histogram sub‐intervals,

= time to taxi in to the gate after landing (taxi‐in time); histogram sub‐intervals,

= time arrived after scheduled arrival time (arrival‐delay time); histogram sub‐intervals,

= time to taxi out from the gate until liftoff (taxi‐out time); histogram sub‐intervals,

= delay time after scheduled departure time (departure‐delay time); histogram sub‐intervals.

Let us calculate the extended Gowda–Diday dissimilarity matrix, from Eqs. (4.2.24)–Eq.(4.2.27), and shown in Table 5.2. By using this Gowda–Diday dissimilarity matrix, we can find the monothetic divisive clustering dendogram, as shown in Figure 5.4, which was built by using the methodology for divisive clustering for histogram‐valued observations described in section 7.2.4 . Figure 5.5 shows the tree constructed when using the polythetic algorithm of section 7.3 , again based on the extended Gowda–Diday dissimilarity matrix. It is immediately clear that these trees are quite different from each other. For example, after the first cut, the airlines (2, 8) are part of the cluster that also includes airlines for the polythetic tree but it is not a part of this cluster in the monothetic tree.

Table 5.2 Flights data: extended Gowda–Diday Matrix (Example 5.1)

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
1	0	2.253	2.929	2.761	2.584	1.973	1.855	1.877	2.114	1.789	1.725	2.723	1.518	1.133	3.801	3.666
2	2.253	0	1.119	2.456	1.914	1.608	1.417	1.085	1.051	2.702	1.011	1.330	1.893	2.299	1.871	2.713
3	2.929	1.119	0	2.560	1.986	1.984	2.057	1.959	1.474	3.343	1.539	1.356	2.723	2.536	1.307	2.616
4	2.761	2.456	2.560	0	1.830	3.546	1.759	2.692	2.675	3.177	2.953	2.044	3.509	2.361	2.436	2.366
5	2.584	1.914	1.986	1.830	0	3.340	1.237	2.357	1.943	3.180	2.482	1.222	3.032	2.268	2.075	2.218
6	1.973	1.608	1.984	3.546	3.340	0	2.726	1.469	1.638	2.704	1.067	2.541	1.446	2.361	2.699	3.852
7	1.855	1.417	2.057	1.759	1.237	2.726	0	2.027	1.846	2.914	1.902	1.124	2.532	1.955	2.160	2.261
8	1.877	1.085	1.959	2.692	2.357	1.469	2.027	0	1.327	2.332	1.164	1.934	1.367	2.540	2.573	3.144
9	2.114	1.051	1.474	2.675	1.943	1.638	1.846	1.327	0	2.909	0.837	1.382	1.598	1.871	2.461	3.062
10	1.789	2.702	3.343	3.177	3.180	2.704	2.914	2.332	2.909	0	2.626	3.306	2.272	2.628	3.877	3.991
11	1.725	1.011	1.539	2.954	2.482	1.067	1.902	1.164	0.837	2.626	0	1.639	1.589	2.066	2.464	3.319
12	2.723	1.330	1.356	2.044	1.222	2.541	1.124	1.934	1.382	3.306	1.639	0	2.687	2.441	1.620	2.210
13	1.518	1.893	2.723	3.509	3.032	1.446	2.532	1.367	1.598	2.272	1.589	2.687	0	2.223	3.580	4.018
14	1.133	2.299	2.536	2.361	2.268	2.361	1.955	2.540	1.871	2.628	2.066	2.441	2.223	0	3.459	3.459
15	3.801	1.871	1.307	2.436	2.075	2.699	2.160	2.573	2.461	3.877	2.464	1.620	3.580	3.459	0	2.396
16	3.666	2.713	2.616	2.366	2.218	3.852	2.261	3.144	3.062	3.991	3.319	2.210	4.018	3.459	2.396	0

Figure 5.4 Monothetic dendogram based on Gowda–Diday dissimilarities (with cuts) (Example 5.1).

Table 5.3 Flights data: normalized cumulative density function Matrix (Example 5.1)

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
1	0	0.261	0.318	0.210	0.166	0.317	0.204	0.221	0.230	0.256	0.251	0.262	0.179	0.111	0.392	0.385
2	0.261	0	0.115	0.241	0.201	0.158	0.115	0.082	0.113	0.446	0.125	0.094	0.155	0.258	0.141	0.206
3	0.318	0.115	0	0.304	0.270	0.126	0.174	0.164	0.153	0.520	0.111	0.110	0.212	0.309	0.093	0.237
4	0.210	0.241	0.304	0	0.207	0.330	0.199	0.216	0.287	0.367	0.307	0.251	0.299	0.206	0.292	0.270
5	0.166	0.201	0.270	0.207	0	0.350	0.132	0.210	0.208	0.346	0.280	0.192	0.220	0.179	0.295	0.306
6	0.317	0.158	0.126	0.330	0.350	0	0.255	0.160	0.167	0.542	0.104	0.195	0.188	0.318	0.144	0.269
7	0.204	0.115	0.174	0.199	0.132	0.255	0	0.136	0.125	0.406	0.174	0.092	0.158	0.196	0.220	0.216
8	0.221	0.082	0.164	0.216	0.210	0.160	0.136	0	0.114	0.417	0.128	0.118	0.116	0.260	0.190	0.231
9	0.230	0.113	0.153	0.287	0.208	0.167	0.125	0.114	0	0.464	0.079	0.099	0.086	0.214	0.235	0.287
10	0.256	0.446	0.520	0.367	0.346	0.542	0.406	0.417	0.464	0	0.487	0.461	0.414	0.312	0.562	0.552
11	0.251	0.125	0.111	0.307	0.280	0.104	0.174	0.128	0.079	0.487	0	0.112	0.139	0.270	0.189	0.262
12	0.262	0.094	0.110	0.251	0.192	0.195	0.092	0.118	0.099	0.461	0.112	0	0.158	0.261	0.166	0.211
13	0.179	0.155	0.212	0.299	0.220	0.188	0.158	0.116	0.086	0.414	0.139	0.158	0	0.194	0.288	0.323
14	0.111	0.258	0.309	0.206	0.179	0.318	0.196	0.260	0.214	0.312	0.270	0.261	0.194	0	0.378	0.369
15	0.392	0.141	0.093	0.292	0.295	0.144	0.220	0.190	0.235	0.562	0.189	0.166	0.288	0.378	0	0.185
16	0.385	0.206	0.237	0.270	0.306	0.269	0.216	0.231	0.287	0.552	0.262	0.211	0.323	0.369	0.185	0

Figure 5.5 Polythetic dendogram based on Gowda–Diday dissimilarities (Example 5.1).

The monothetic divisive algorithm, based on the normalized cumulative density dissimilarity matrix of Eq. (4.2.47) (shown in Table 5.3), produces the tree of Figure 5.6; the polythetic tree (not shown) is similar by the end though the divisions/splits occur at different stages of the tree construction. This tree is also quite different from the tree of Figure 5.4, though both are built using the monothetic algorithm. Not only are the final clusters different, the cut criteria at each stage of the construction differ from each other. Even at the first division, the Gowda–Diday‐based tree has a cut criterion of “Is ”, whereas for the tree of Figure 5.6 based on the cumulative density dissimilarity matrix, the cut criterion is “Is ”. A hint of why these trees differ can be found by comparing the different dissimilarity matrices in Tables 5.2 and 5.3.

When an average‐link agglomerative algorithm (described in section 8.1 ) is applied to these same histogram data, the tree of Figure 5.7 emerges. As is apparent by comparison with the trees of Figures 5.4–5.6, this tree is different yet again. This tree is also different from that shown in Figure 5.8, obtained when using a complete‐link agglomerative algorithm based on the extended Ichino–Yaguchi dissimilarities for histogram data.

Figure 5.6 Monothetic dendogram based on cumulative density function dissimilarities (with cuts) (Example 5.1).

Figure 5.7 Average‐link agglomerative dendogram based on Gowda–Diday dissimilarities (histograms) (Example 5.1).

Figure 5.8 Complete‐link agglomerative dendogram based on Ichino–Yaguchi Euclidean dissimilarities (histograms) (Example 5.1).

In contrast, the number () of individual observations is too large for any classical agglomerative algorithm to work on the classical data set directly. Were we to seek such a tree, then we could take the sample means (shown in Table 5.4) of the aggregated values for each airline and each variable to use in any classical agglomerative algorithm. The tree that pertains when using the complete‐link method on the Euclidean distance between these means is the tree of Figure 5.9. While both Figures 5.8 and 5.9 were each built as a complete‐link tree, the results are very different. These differences reflect the fact that when using just the means (as in Figure 5.9), a lot of the internal variation information is lost. However, by aggregating the internal values for each airline into histograms, much of this lost information is retained, that is, the tree of Figure 5.8 uses all this information (both the central mean values plus internal measures of variations). This can be illustrated, e.g., by looking at the cluster on the far right‐hand side of the tree of Figure 5.9. This tree is built on the means, and we see from Table 5.4 and Figure 5.10 (which shows the bi‐plots of the sample means and ) that these three airlines (7, 13, and 16) have comparable means. Furthermore, we also know from Table 5.4 that airline 13 has a much smaller sample variance for while the corresponding sample variances for airlines 7 and 16 are much larger. There are similar comparisons between these airlines for (see Table 5.4). However, these variances are not used in the tree construction here, only the means. On the other hand, both the means and the variances, through the internal variations, are taken into account when constructing the symbolic‐based tree of Figure 5.8. In this tree, these particular airlines fall into three distinct parts (larger branches, left, middle, right) of the tree.

Table 5.4 Flights data: sample means and standard deviations (individuals) (Example 5.1)

Airline	Means					Standard deviations

1	183.957	9.338	16.794	21.601	16.029	74.285	9.810	46.137	13.669	41.445
2	90.432	7.043	9.300	20.066	12.303	25.270	6.326	37.860	13.252	33.291
3	56.703	6.507	11.038	20.528	7.755	56.703	3.619	33.161	12.860	29.366
4	175.257	5.345	5.793	17.393	9.068	120.702	3.656	25.008	9.388	20.858
5	158.226	7.715	4.421	22.529	3.183	55.297	3.837	26.773	10.411	22.003
6	65.619	7.090	16.486	18.969	18.515	24.305	9.324	52.194	14.423	45.134
7	117.937	8.183	6.404	22.222	5.5431	70.147	6.032	26.700	13.618	20.885
8	98.906	9.111	8.713	17.966	12.662	29.935	8.244	37.140	9.569	34.142
9	93.355	8.355	14.342	22.028	11.693	28.190	5.063	38.516	11.175	34.701
10	293.591	9.780	10.522	19.562	9.075	41.300	12.197	37.942	10.888	32.951
11	73.689	8.100	17.840	20.941	14.171	30.956	7.060	40.788	13.040	34.866
12	89.934	8.114	10.997	21.323	5.180	36.885	4.324	29.889	13.757	23.727
13	107.329	9.464	15.856	21.533	12.907	32.546	9.779	50.134	10.806	41.337
14	178.243	6.223	14.059	23.269	12.754	99.283	4.968	44.831	14.289	40.294
15	57.498	5.540	‐0.581	17.582	1.593	24.836	3.671	22.611	12.703	15.712
16	102.514	3.761	3.762	10.877	8.203	61.890	2.239	26.478	8.042	22.390

Figure 5.9 Complete‐link agglomerative dendogram (means) (Example 5.1).

Figure 5.10 Means (Example 5.2).

Example 5.2

Suppose now the airline data used in Example 5.1 were prepared as 5–95% quantile intervals, as shown in Table 5.5. Then, a symbolic partitioning on the 5–95% quantile intervals again using Euclidean distances produced the partition with the clusters, , , , , and (see section 6.3 for partitioning methods of interval data). This partition is superimposed on the bi‐plot of the values in Figure 5.11(a) (with the uncircled observations forming the cluster ). A ‐means partition on the means data of Table 5.4, shown in Figure 5.11(b) (with the un‐circled observations forming the cluster ), for the partition with , , , , and , is different from the symbolic partitioning on the interval data. However, this partition on the means is consistent with the four major branches of the complete‐link tree of Figure 5.9, which was also based on the means. The plot of Figure 5.10 suggests that = flight time is an important variable in the partitioning process. A ‐means partition for on the means data separates out airline as a one‐airline fifth partition. Figure 5.10 further suggests that = arrival‐delay time is also a defining factor, at least for some airlines. Indeed, we see from the monothetic trees built on the histograms (and so using internal variations plus the means) that after flight time, it is this arrival‐delay time that helps separate out the airlines.

Table 5.5 Flights data (5–95%) quantiles (Example 5.2)

						Number
1	[95, 339]	[3, 25]	[24, 91]	[10, 43]	[8, 79]	6340
2	[46, 128]	[2, 16]	[28, 60]	[7, 41]	[10, 55]	5590
3	[34, 87]	[3, 12]	[19, 67]	[8, 42]	[10, 57]	2256
4	[45, 462]	[2, 11]	[22, 50]	[9, 34]	[8, 45]	5228
5	[55, 219]	[4, 15]	[24, 50]	[11, 44]	[9, 41]	726
6	[39, 120]	[2, 21]	[25, 120]	[7, 40]	[10, 113]	3175
7	[36, 285]	[3, 18]	[21, 54]	[10, 44]	[8, 38]	5969
8	[38, 135]	[3, 22]	[22, 74]	[7, 35]	[6, 72]	352
9	[48, 135]	[4, 17]	[23, 88]	[10, 44]	[10, 85]	1320
10	[234, 365]	[3, 28]	[30, 56]	[9, 35]	[9, 48]	665
11	[36, 134]	[3, 20]	[23, 92]	[8, 44]	[11, 78]	4877
12	[43, 161]	[4, 15]	[18, 55]	[9, 46]	[10, 39]	2294
13	[85, 128]	[3, 26]	[22, 115]	[10, 44]	[10, 111]	578
14	[65, 350]	[3, 13]	[23, 95]	[11, 46]	[8, 86]	2357
15	[33, 100]	[2, 11]	[23, 37]	[8, 36]	[8, 28]	4232
16	[42, 230]	[2, 6]	[20, 47]	[5, 20]	[0, 48]	2550

Figure 5.11 Partitioning on airlines: (a) Euclidean distances and (b) ‐means on means (Example 5.2).

Example 5.3

In Figure 5.12, we have the pyramid built on the 5–95% quantile intervals of Table 5.5. This highlights four larger clusters that appear in the pyramid tree, specifically, , , , and . These plots show the overlapping clusters characteristic of pyramids. For example, airline is contained in both and clusters, whereas each airline is in its own distinct cluster from the other methodologies. Notice also that these three airlines in appear in different parts of the relevant trees and partition considered herein, e.g., in Figure 5.8 at the three‐cluster stage, these airlines are in three different clusters. However, the tree in Figure 5.8 was based on the means only and ignored the additional information contained in the observation internal variations.

Figure 5.12 Pyramid hierarchy on intervals (Example 5.3).

Example 5.4

Figure 5.13 displays simulated individual classical observations drawn from bivariate normal distributions . There are five samples each with observations, and with and as shown in Table 5.6. Each of the samples can be aggregated to produce a histogram observation , . When any of the divisive algorithms (see Chapter 7 ) are applied to these histogram data, three clusters emerge containing the observations , , and , respectively. In contrast, applying such algorithms to classical surrogates (such as the means), only two clusters emerge, namely, and , and the ‐means algorithm on the individual observations gives the same two clusters. That is, the classical methods are unable to identify observations as belonging to different clusters as the information on the internal variations is lost.

Figure 5.13 Five simulated histograms, clusters identified (Example 5.4).

Table 5.6 Simulated bivariate normal distribution parameters (Example 5.4)

Sample
1
. . . . . . .	. . . . .	. . . . . . . . . . . . .
2, 3
. . . . . . .	. . . . .	. . . . . . . . . . . . .
4, 5

5.5 Other Issues

The choice of an appropriate number of clusters is still unsolved for partitioning and hierarchical procedures. In the context of classical data, several researchers have tried to provide answers, e.g., Milligan and Cooper (1985), and Tibshirani and Walther (2005). One approach is to run the algorithm for several specific values of and then to compare the resulting approximate weight of evidence (AWE) suggested by Banfield and Raftery (1993) for hierarchies and Fraley and Raftery (1998) for mixture distribution model‐based partitions. Kim and Billard (2011) extended the Dunn (1974) and Davis and Bouldin (1979) indices to develop quality indices in the hierarchical clustering context for histogram observations which can be used to identify a “best” value for (see section 7.4).

Questions around robustness and validation of obtained clusters are largely unanswered. Dubes and Jain (1979), Milligan and Cooper (1985), and Punj and Stewart (1983) provide some insights for classical data. Fräti et al. (2014) and Lisboa et al. (2013) consider these questions for the ‐means approach to partitions. Hardy (2007) and Hardy and Baume (2007) consider the issue with respect to hierarchies for interval‐valued observations.

Other topics not yet developed for symbolic data include, e.g., block modeling (see, e.g., Doreian et al. (2005) and Batagelj and Ferligoj (2000)), constrained or relational clustering where internal connectedness (such as geographical boundaries) is imposed (e.g., Ferligoj and Batagelj (1982, 1983, 1992)), rules‐based algorithms along the lines developed by Reynolds et al. (2006) for classical data, and analyses for large temporal and spatial networks explored by Batagelj et al. (2014).

In a different direction, decision‐based trees have been developed by a number of researchers, such as the classification and regression trees (CART) of Breiman et al. (1984) for classical data, in which a dependent variable is a function of predictor variables as in standard regression analysis. Limam (2005), Limam et al. (2004), Winsberg et al. (2006), Seck (2012), and Seck et al. (2010), have considered CART methods for symbolic data, but much more remains to be done.

In a nice review of model‐based clustering for classical data, Fraley and Raftery (1998) introduce a partitioning approach by combining hierarchical clustering with the expectation‐maximization algorithm of Dempster et al. (1977). A stochastic expectation‐maximization algorithm was developed in Celeux and Diebolt (1985), a classification expectation‐maximization algorithm was introduced by Celeux and Govaert (1992), and a Monte Carlo expectation‐maximization algorithm was described by Wei and Tanner (1990). How these might be adapted to a symbolic‐valued data setting are left as open problems.

Other exploratory methods include factor analysis and principal component analysis, both based on eigenvalues/eigenvectors of the correlation matrices, or projection pursuit (e.g. Friedman, 1987). Just as different distance measures can produce different hierarchies, so can the different approaches (clustering per se compared to eigenvector‐based methods) produce differing patterns. For example, in Chapter 7 , Figure 7.6 shows the hierarchy that emerged when a monothethic divisive clustering approach (see section 7.2 , Example 7.7) was applied to the temperature data set of Table 7.9. The resultant clusters are different from those obtained in Billard and Le‐Rademacher (2012) when a principal component analysis was applied to those same data (see Figure 5.14).

Chart displaying the resultant clusters of a principal component analysis applied on the temperature data of a particular country. — Figure 5.14 Principal component analysis on China temperature data of Table 7.9.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5 General Clustering Techniques

Create new playlist

Sign In

Sign Up

5.1 Brief Overview of Clustering

5.2 Partitioning

5.3 Hierarchies

5.4 Illustration

5.5 Other Issues

Table of Contents for
5 General Clustering Techniques