Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3

Singular Value Decomposition, Clustering, and Indexing for Similarity Search for Large Data Sets in High-Dimensional Spaces

Alexander Thomasian

Abstract

Representing objects such as images by their feature vectors and searching for similarity according to the distances of the points representing them in high-dimensional space via k-nearest neighbors (k-NNs) to a target image is a popular paradigm. We discuss a combination of singular value decomposition (SVD), clustering, and indexing to reduce the cost of processing k-NN queries for large data sets with high-dimensional data. We first review dimensionality reduction methods with emphasis on SVD and related methods, followed by a survey of clustering and indexing methods for high-dimensional numerical data. We describe combining SVD and clustering as a framework and the main memory-resident ordered partition (OP)-tree index to speed up k-NN queries. We discuss techniques to save the OP-tree on disk and specify the stepwise dimensionality increasing (SDI) index suited for k-NN queries on dimensionally reduced data.

3.1 Introduction

IBM’s Query by Image Content (QBIC) project in the 1990s [1] utilized content-based image retrieval (CBIR) or similarity search based on features extracted from images. We are concerned with the storage space and processing requirements for CBIR, which are important because of the high data volumes and the associated processing requirements, especially if the similarity search is conducted by a naive sequential scan of the whole data set. The clustering with singular value decomposition (CSVD) method and the ordered partition (OP)-tree index [2] are used in Reference 3 in applying CBIR to indexing satellite images [4].

CBIR has two steps [5]: (1) Characterize images by their features and define similarity measures. (2) Extract features such as color, texture, and shape [6]. Texture features include the fractal dimension, coarseness, entropy, circular Moran autocorrelation functions, and spatial gray-level difference (SGLD) statistics [7]. In addition to photographic, medical, and satellite images, content-based retrieval is used in conjunction with audio and video clips and so forth.

Given a target image represented by a point in high-dimensional space, images similar to it can be determined via a range query, which determines all points within a certain radius. Since a range query with a small radius may not yield any query points, k-nearest neighbor (k-NN) queries are used instead [8]. The distance may be based on the Euclidean or more complex distance functions, such as the Mahalanobis distance [8]. k-NN queries can be implemented by a sequential scan of the data set holding the feature vectors of all images, while keeping track of the nearest k. The naive algorithm has a running time of O(MN), where M is the cardinality of data set (number of images) and N is the dimensionality (number of features). Amazingly, naive search may outperform multidimensional indexing methods in higher-dimensional data spaces [9].

Clustering is one method to reduce the cost of k-NN search from the viewpoint of CPU and disk processing. Given H clusters with an equal number of points per cluster, the cost of k-NN queries is reduced H-fold, provided that only the data points in one cluster are to be searched. A survey of clustering methods with emphasis on disk-resident high-dimensional data is provided in this chapter. With the advent of storage-class memory (SCM) [10], more and more data are expected to migrate from magnetic disks [11] to such memory.

Clusters are usually envisioned as hyperspheres represented by their centroid and radius (the distance of the farthest point from the centroid). A query point may reside in the volume shared by two or more hyperspheres or hypercubes depending on the indexing structure, so that multiple clusters need to be searched to locate the k-NNs. k-NN search has to be trimmed as much as possible to reduce query processing cost [3]. Indexing is a form of clustering; however, most indexing structures, such as those in the R-tree family, perform poorly in higher-dimensional data spaces [8]. More details on this topic appear in Section 3.7.

Principal component analysis (PCA) [12], singular value decomposition (SVD), and Karhunen–Loève transform (KLT) are related methods to reduce the number of dimensions of high-dimensional data after rotating coordinates to attain minimal loss in information for the desired level of data compression [8,13]. Dimensionality reduction can be optionally applied to the original data set before applying clustering. Dimensionality reduction applied to clusters takes advantage of local correlations, resulting in a smaller normalized mean square error (NMSE) for the same reduction in data volume as when SVD is applied to the whole data set. The method is therefore called recursive clustered SVD (RCSVD), although SVD was applied to clusters in References 3 and 14.

As an example that combining clustering with SVD results in higher dimensionality reduction than applying SVD without clustering, consider points surrounding multiple nonintersecting lines in three-dimensional space, which can be specified by their single dominant dimension, that is, a threefold reduction in space requirements, but also computational cost for k-NN queries. Experimentation with several real-life data sets has shown lower information loss with the same compression ratio or vice versa when clustering is combined with SVD.

The local dimensionality reduction (LDR) method [15] finds local correlations and performs dimensionality reduction on these data. However, it is shown in Reference 3 that CSVD with an off-the-shelf clustering method, such as k-means, outperforms LDR.

A further reduction in cost can be attained by reducing the number of points to be considered by building an index for the data in each cluster. It turns out that popular indexes such as R-trees are inefficient in processing k-NN queries in high dimensions. Indexing methods are surveyed in References 16 and 17, but a more focused and up-to-date survey for high-dimensional data is provided in this chapter.

The chapter is organized as follows. Section 3.2 discusses the issue of dimensionality reduction. Section 3.3 presents clustering methods with emphasis on large, high-dimensional, disk-resident data sets. The method in Reference 3 to build the data structures for k-NN queries is described in Section 3.4. Section 3.5 specifies the method for determining similarity. Section 3.6 discusses the LDR method in Reference 15. Section 3.7 surveys indexing structures. Section 3.8 summarizes the discussion in the chapter. A more efficient method [18] to compute the Euclidean distance with dimensionally reduced data with SVD is given in the appendix.

3.2 Data Reduction Methods and SVD

Data reduction methods are of interest in several domains, such as query cost estimation in relational databases [19], data mining, decision support, online analytical processing (OLAP) [20], data compression, information retrieval, and data visualization. We list the data reduction methods surveyed in Reference 21 as follows:

SVD, PCA, and KLT are related dimensionality reduction methods, which are discussed in this section.
Discrete wavelet transform (DWT) has many variations [13], such as Daubechies transform [22]. It has been utilized in Reference 23 for data compression in a relational database.
Regression, in its simplest form, is the linear regression of a single variable model, that is, variable Y as a linear function of variable X, Y = α + βX, but regression can be extended to more than one variable [13].
Log-linear models are a methodology for approximating discrete multidimensional probability distributions; for example, the multiway table of joint probabilities is approximated by a product of lower-order tables [24].
Histograms approximate the data in one or more attributes of a relation by their frequencies [25].
Clustering techniques are discussed in Section 3.3.
Indexing structures are discussed in Section 3.7.
Sampling is an attempt to represent a large set of data by a small random sample of its data elements. Sampling in databases is reviewed in Reference 26, and Reference 27 is a representative publication.

We only consider SVD and related methods in our discussion in the context of the discussion in References 3 and 14. Consider matrix X, whose M rows are the feature vectors for images. Each image has N features, which is the number of columns in the matrix. Subtracting the mean from the elements in each column yields columns with a zero mean. The columns may be furthermore studentized by dividing the elements by the standard deviation. The SVD of X is given as follows [8,28]:

X = UΣV T, (3.1)

where U is an M × N column-orthonormal matrix, V is an N × N unitary matrix of eigenvectors, and Σ is a diagonal matrix of singular values, without loss of generality (σ1 ≥ σ2 ≥.…, ≥ σN). When the number of singular values that are not zero or close to zero is n, then the rank of X is n [8]. The cost of computing eigenvalues using SVD is O(MN)2. The following example shows the decomposition of X and that it has a rank of three, since two of its eigenvalues are zero.

X=⎛⎝⎜⎜⎜10000004030000002000⎞⎠⎟⎟⎟U=⎛⎝⎜⎜⎜00010100100000−10⎞⎠⎟⎟⎟Σ=⎛⎝⎜⎜⎜40000300005–√000000000⎞⎠⎟⎟⎟VT=⎛⎝⎜⎜⎜⎜⎜⎜000.2−−−√0−0.8−−−√100000100000010000.8−−−√00.2−−−√⎞⎠⎟⎟⎟⎟⎟⎟ $\begin{array}{l} X = (\begin{matrix} 1 & 0 & 0 & 0 & 2 \\ 0 & 0 & 3 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 4 & 0 & 0 & 0 \end{matrix}) U = (\begin{matrix} 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & - 1 \\ 1 & 0 & 0 & 0 \end{matrix}) \\ Σ = (\begin{matrix} 4 & 0 & 0 & 0 & 0 \\ 0 & 3 & 0 & 0 & 0 \\ 0 & 0 & \sqrt{5} & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{matrix}) V^{T} = (\begin{matrix} 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ \sqrt{0.2} & 0 & 0 & 0 & \sqrt{0.8} \\ 0 & 0 & 0 & 1 & 0 \\ - \sqrt{0.8} & 0 & 0 & 0 & \sqrt{0.2} \end{matrix}) \end{array}$

PCA is based on eigenvalue decomposition of the covariance matrix:

C=1MXTX=VΛVT, $C = \frac{1}{M} X^{T} X = V Λ V^{T} ,$ (3.2)

where V is the matrix of eigenvectors (as in Equation 3.1) and Λ is a diagonal matrix holding the eigenvalues of C. C is positive-semidefinite, so that its N eigenvectors are orthonormal and its eigenvalues are nonnegative. The trace (sum of eigenvalues) of C is invariant under rotation. The computational cost in this case is higher than SVD and requires a matrix multiplication O(MN 2) followed by eigenvalue decomposition O(N 3).

We assume that the eigenvalues, similar to the singular values, are in nonincreasing order, so that λ1 ≥ λ2 ≥ … ≥ λN. To show the relation between singular values and eigenvalues, we substitute X in Equation 3.2 with its decomposition in Equation 3.1.

C=1MXTX=1M(VΣUT)(UΣVT)=1MVΣ2VT. $C = \frac{1}{M} X^{T} X = \frac{1}{M} (V Σ U^{T}) (U Σ V^{T}) = \frac{1}{M} V Σ^{2} V^{T} .$

It follows that

Λ=1MΣ2. $Λ = \frac{1}{M} Σ^{2} .$ (3.3)

In effect λi=σ2i/M $λ_{i} = σ_{i}^{2} / M$ and, conversely, σi=Mλi−−−−√ $σ_{i} = \sqrt{M λ_{i}}$ for 1 ≤ i ≤ N.

The eigenvectors constitute the principal components of X; hence, the transformation below yields uncorrelated features:

Y = XV. (3.4)

Retaining the first n dimensions of Y maximizes the fraction of preserved variance and minimizes the NMSE:

NMSE=∑Mi=1∑Nj=n+1y2i,j∑Mi=1∑Nj=1y2i,j=∑Nj=n+1λj∑Nj=1λj $N M S E = \frac{\sum_{i = 1}^{M} \sum_{j = n + 1}^{N} y_{i, j}^{2}}{\sum_{i = 1}^{M} \sum_{j = 1}^{N} y_{i, j}^{2}} = \frac{\sum_{j = n + 1}^{N} λ_{j}}{\sum_{j = 1}^{N} λ_{j}}$ . (3.5)

The computational method in References 13 and 29 were used in References 3 and 14, respectively. A more efficient method was used in References 3 and 14 than in Reference 18 in computing the Euclidean distance with dimensionally reduced data with SVD as specified in the appendix.

Recall and precision are two metrics that are used to quantify retrieval accuracy and efficiency for approximate k-NN search with dimensionally reduced data [8]. Since SVD is a lossy data compression method, to obtain k points, we retrieve k* ≥ k points. Given that k′ of the retrieved points are among the k desired nearest neighbors, k′ ≤ k ≤ k*. Precision (efficiency) is defined as =k′/k∗ $= k^{'} / k^{*}$ , and recall (accuracy) is defined as $ℛ = k^{'} / k$ . Thus, precision is the percentage of relevant images among the retrieved ones, that is, the ratio of retrieved-and-relevant to retrieved. Recall is the percentage of relevant images that were retrieved over the total number of relevant images in the image collection, so that recall is the ratio of retrieved-and-relevant to relevant. Thus, high precision means that we have few false alarms; high recall means that we have few false dismissals. Since k-NN search on dimensionally reduced data is inherently approximate, so that $ℛ < 1$ , postprocessing is required to achieve exact k-NN query processing on dimensionally reduced data [30].

3.3 Clustering Methods

Cluster analysis is the process of partitioning a set of objects specified by points in multiple dimensions into clusters, so that the objects in a cluster are similar to each other but dissimilar from objects in other clusters. The requirements for a function on pairs of points to be a distance measure are as follows: (1) Distances are nonnegative, and the distance of a point to itself is zero. (2) Distances are symmetric. (3) Distance measures obey the triangle inequality, that is, the distance from A to B plus B to C is larger than the distance going from A to C directly. Similarity is defined by Euclidean or some other form of distance, in which case the three conditions are satisfied, but in some cases, distance is given directly.

As noted in Section 3.1, the cost of processing k-NN queries using the naive method can be reduced by applying clustering. With H clusters, the cost is reduced by a factor H provided when only one cluster is to be searched and all clusters have the same number of points. An even higher reduction in cost is attained when the point belongs to a small cluster.

A large number of clustering algorithms, which are applicable to various domains, have been developed over the years and are discussed in numerous textbooks dedicated to this topic [31–34] but also in texts on data mining [20] and Big Data [28,35]. Clustering algorithms over numerical data are of interest in our discussion. Four problems need to be overcome for clustering high-dimensional data [36].

Distance defined as (Dmax - Dmin)/Dmin, where Dmax and Dmin are the points with the farthest and least distance from a target point, decreases with increasing dimensionality. Setting 0 ≤ p ≤ 1 has been proposed in References 37 and 38 to provide more meaningful distance measures.
$D (u, v) = {[\sum_{i = 1}^{n} {| u_{i} - v_{i} |}^{p}]}^{1 / p} .$ (3.6)
p = 1 yields the Manhattan distance and p = 2 the Euclidean distance.
Clusters are intended to group objects based on attribute values, but some of the large number of attributes may not be meaningful for a given cluster. This is known as the local feature relevance problem: Different clusters might be found in different subspaces.
Subsets of attributes might be correlated; hence, clusters might exist in arbitrarily oriented affine subspaces (in affine geometry, the coordinates are not necessarily orthonormal).
Problem 4 is similar to problem 3 in that there may be some correlations among subsets of attributes.

Given n points and k clusters, the number of possible clusters is a Stirling number of the second kind defined as follows:

$S (n, k) = {\begin{matrix} n \\ k \end{matrix}} = \frac{1}{k!} \sum_{t = 0}^{k} {(- 1)}^{t} (\begin{matrix} k \\ t \end{matrix}) {(k - t)}^{n}$ .

S(n, 1) = S(n, n) = 1, S(4, 2) = 7 because there are seven ways to partition four objects, a, b, c, d, into two groups:

(a)(bcd), (b)(acd), (c)(abd), (d)(abc), (ab)(cd), (ac)(bd), (ad)(bc)

Recursion can be used to compute S(n, k) = kS(n − 1, k) + S(n − 1, k − 1), 1 ≤ k < n, and to show that S(n.k) increases rapidly, for example, S(10, 5) = 42525.

Cluster analysis can be classified into the following methods.

3.3.1 Partitioning Methods

Partition n data points based on their distances from each other into k clusters. Each cluster has at least one point, and each point belongs to one cluster, unless a fuzzy clustering algorithm is considered, in which case the allocation of data points to clusters is not all or nothing [39]. In fact, the expectation maximization (EM) method provides soft assignment of points to clusters, such that each point has a probability of belonging to each cluster.

k-means is a good example of partitioning-based clustering methods, which can be specified succinctly as follows [40]:

Randomly select H points from the data set to serve as preliminary centroids for clusters. Centroid spacing is improved by using the method in Reference 41.
Assign points to the closest centroid based on the Euclidean distance to form H clusters.
Recompute the centroid for each cluster.
Go back to step 2, unless there is no new reassignment.

The quality of the clusters varies significantly from run to run, so that the clustering experiment is repeated and the one yielding the smallest sum of squares (SSQ) distances to the centroids of all clusters is selected:

$S S Q = \sum_{h = 1}^{H} \sum_{i \in C^{(h)}} \sum_{j = 1}^{N} {(x_{i, j} - μ_{j}^{(h)})}^{2}$ . (3.7)

$C^{(h)}$ denotes the hth cluster, with centroid Ch with coordinates given by ${\vec{μ}}^{(h)}$ and radius Rh, which is the distance of the farthest point in the cluster from Ch. SSQ decreases with increasing H, but a point of diminishing returns is soon reached. The k-means clustering method does not scale well to large high-dimensional data sets [42].

Clustering Large Applications Based on Randomized Search (CLARANS) [43] extends Clustering Large Applications (CLARA), described in Reference 33. Clustering is carried out with respect to representative points referred to as medoids. This method, also known as k-medoids, is computationally expensive because (1) each iteration requires trying out new partitioning representatives through an exchange process and (2) a large number of passes over the data set may be required. CLARANS improves efficiency by performing iterative hill climbing over a smaller sample.

3.3.2 Hierarchical Clustering

This is also called connectivity-based clustering since it connects objects to form clusters based on their distance. Hierarchical clustering methods fall into two categories:

Agglomerative or bottom up, where at first, each point is its own cluster, and pairs of clusters are merged as one moves up the hierarchy
Divisive or top down, where all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy

Agglomerative clustering is O(N3), although faster implementations are known, while divisive clustering is O(2N).

Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH) can find a good clustering with a single scan of data with a limited amount of main memory available [44]. It improves the quality further with a few additional scans. A hierarchical data structure called CF-tree, which is a height-balanced tree, is used to store the clustering features (CFs). It is an order-sensitive clustering algorithm, which may detect different clusters for different input orders of the same data. BIRCH can only generate spherically shaped clusters.

Consider a cluster of n d-dimensional data objects or points. The CF, $(C F) = (m, \vec{L S}, \vec{S S})$ , where $\vec{L S}$ and $\vec{S S}$ are the vectors for the N dimensions of linear sums and square sums of the m points, is as follows:

$\vec{L S} = \sum_{i = 1}^{m} x_{i, j} \vec{S S} = \sum_{i = 1}^{m} x_{i, j}^{2} .$

In the first phase, BIRCH scans the database to build an initial in-memory CF-tree, which can be viewed as a multilevel compression of the data that tries to preserve the inherent clustering structure of data. In the second phase, BIRCH applies a clustering algorithm to cluster the leaf nodes of the CF-tree, which removes sparse clusters as outliers and groups dense clusters into larger ones. Using CF, we can easily derive many useful statistics of a cluster: the cluster’s centroid, radius, and diameter. To merge two clusters C1 and C2, we simply have the following:

$C F = (C F_{1} + C F_{2}) = (m_{1} + m 2, {\vec{L S}}_{1} + {\vec{L S}}_{2}, {\vec{S S}}_{1} + {\vec{S S}}_{2})$ .

For two dimensions X and Y, we have the following:

$\vec{L S} = (L S_{X}, L S_{Y}) = (\sum_{i = 1}^{m} x_{i}, \sum_{i = 1}^{m} y_{i}), \vec{S S} = (S S_{X}, S S_{Y}) = (\sum_{i = 1}^{m} x_{i}^{2}, \sum_{i = 1}^{m} y_{i}^{2})$ .

The centroid is (x0, y0) = (SSX/m, SSY/m), and the radius is as follows:

$R = \frac{1}{m} \sqrt{S S_{X} + S S_{Y} + (1 / m - 2) (L S_{X}^{2} + L S_{Y}^{2})}$ .

Generalizing to N dimensions with subscript 1 ≤ n ≤ N,

$R = \frac{1}{m} \sqrt{\sum_{n = 1}^{N} S S_{n} + (1 / m - 2) \sum_{n = 1}^{N} L S_{n}^{2}}$ .

An algorithm similar to BIRCH is described in Reference 45.

Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the similarity between pairs of clusters [46]. Cluster similarity is assessed based on (1) how well connected objects are within a cluster and (2) the proximity of clusters. Two clusters are merged if their interconnectivity is high and they are close together. A k-NN graph approach constructs a sparse graph, where each vertex of the graph represents a data object, and there is an edge between two vertices if one object is among the k most similar objects to the other. The edges are weighted to reflect the similarity between objects. A graph-partitioning algorithm partitions the k-NN graph into a large number of relatively small subclusters such that it minimizes the cost of cutting edges [47]. Chameleon is more capable of discovering arbitrarily shaped clusters of high quality than several well-known algorithms such as BIRCH and density-based spatial clustering of applications with noise (DBSCAN) (see also FastMap [FM] in Reference 8).

Clustering Using Representatives (CURE) uses multiple representations to denote each cluster, so that clusters with nonspherical shapes can be discovered and there is less sensitivity to outliers [48]. The representatives are generated by selecting well-scattered points in the cluster and then shrinking them toward the center of the cluster by a specified fraction. At each iteration, the process merges two clusters until the number of clusters reaches the specified value. CURE uses a minimum heap to keep track of distances between one cluster and its closest cluster and one k-d-tree [49] to keep track of the representatives in all clusters. A heap extracts the cluster with the minimum distance by searching the nearest neighbors for all representatives in a cluster. CURE tends to be slow due to its dependence on the k-d-tree when handling large high-dimensional data sets.

BUBBLE/FM is an extension of BIRCH to distance spaces. The only operation possible on data objects is the computation of distance between them, while all scalable algorithms in the literature assume a k-dimensional vector space, which allows vector operations on objects [50]. The distance function associated with a distance space can be computationally expensive. For example, the distance between two strings is the edit distance, which, for strings of lengths N, requires O(N 2) comparisons. BUBBLE-FM improves upon BUBBLE by reducing the number of calls to the distance function, which tends to be computationally expensive.

3.3.3 Density-Based Methods

Most partitioning methods for cluster objects based on the distance between objects can find only spherical-shaped clusters and encounter difficulty in discovering clusters of arbitrary shapes. The general idea behind density-based clustering methods is to continue growing a given cluster as long as the neighborhood density exceeds some threshold, so that clusters are defined as areas of higher density than the remainder of the data set. Objects in the sparse areas are usually considered to be noise and border points. Density-based methods can divide a set of objects into multiple exclusive clusters or a hierarchy of clusters.

DBSCAN is so called because it finds a number of clusters starting from the estimated density distribution of corresponding nodes [51]. DBSCAN’s definition of a cluster is based on the notion of density reachability, that is, a point q is directly density-reachable from a point p if it is not farther away than a given distance ε.

Ordering Points to Identify the Clustering Structure (OPTICS) [52] can be seen as a generalization of DBSCAN to multiple ranges, effectively replacing the ε parameter with a maximum search radius. To detect meaningful data clusters of varying densities, the points of the database are (linearly) ordered, and points that are spatially closest become neighbors in the ordering.

3.3.4 Grid-Based Methods

These methods are based on quantizing the object space into a finite number of cells that form a grid structure. All the clustering operations are performed on the grid structure in quantized space. The main advantage of this approach is fast processing time, which is linearly proportional to the number of data objects. Grid-based methods can be integrated with other clustering methods.

A statistical information grid (STING) is a grid-based multiresolution clustering technique in which the embedding spatial area of the input objects is divided into rectangular cells, which can be divided in a hierarchical and recursive way [53]. Higher-level cells are partitioned to form a number of cells at the next lower level. Statistical information regarding the attributes in each grid cell, such as the mean, maximum, and minimum values, is precomputed.

Clustering in Quest (CLIQUE) is a simple grid-based method and is also a subspace clustering method [54], which is discussed further in Section 3.3.5.

WaveCluster is a multiresolution clustering algorithm that first summarizes the data by imposing a multidimensional grid structure [55]. It then applies a wavelet transform to the original feature space, finding dense regions in the transformed space. In this approach, each grid cell summarizes the information of a group of points that map into the cell. This summary information typically fits into the main memory for use by the multiresolution wavelet transform and the subsequent cluster analysis.

Optimal grid partitioning (OptiGrid) constructs an optimal grid on data by calculating the best partitioning hyperplanes for each dimension [56]. The advantages of OptiGrid are as follows: (1) It has a firm mathematical basis. (2) It is by far more effective than existing clustering algorithms for high-dimensional data. (3) It is very efficient even for large data sets of high dimensionality.

3.3.5 Subspace Clustering Methods

Subspace clustering methods find clusters in different, lower-dimensional subspaces of a data set, taking advantage of the fact that many dimensions in high-dimensional data sets are redundant and hide clusters in noisy data. The curse of dimensionality also makes distance measures meaningless to an extent that data points are equidistant from each other.

A recent survey on subspace clustering clarifies the following issues [36]: (1) different problem definitions related to subspace clustering; (2) specific difficulties encountered in this field of research; (3) varying assumptions, heuristics, and intuitions forming the basis of different approaches; and (4) prominent solutions for tackling different problems.

The bottom-up search method takes advantage of the downward closure property of density to reduce the search space, using an a priori-style approach in data mining [20], which means that if there are dense units in k dimensions, there are dense units in all (k − 1) dimensional projections. Algorithms first create a histogram for each dimension and select those bins with densities above a given threshold. Candidate subspaces in two dimensions can then be formed using only those dimensions that contain dense units, dramatically reducing the search space. The algorithm proceeds until there are no more dense units found. Adjacent dense units are then combined to form clusters.

CLIQUE is the first subspace clustering algorithm combining density and grid-based clustering [54], which was developed in conjunction with the Quest data mining project at IBM [57]. An a priori-style search method was developed for association rule mining (ARM) to find dense subspaces [54]. Once the dense subspaces are found, they are sorted by coverage; only subspaces with the greatest coverage are kept, and the rest are pruned. The algorithm then looks for adjacent dense grid units in each of the selected subspaces using a depth-first search. Clusters are formed by combining these units using a greedy growth scheme. The algorithm starts with an arbitrary dense unit and greedily grows a maximal region in each dimension, until the union of all the regions covers the entire cluster. The weakness of this method is that the subspaces are aligned with the original dimensions.

The Entropy-Based Clustering Method (ENCLUS) is based heavily on the CLIQUE algorithm, but instead of measuring density or coverage, it measures entropy based on the observation that a subspace with clusters typically has lower entropy than a subspace without clusters, that is, entropy decreases as cell density increases [58]. ENCLUS uses the same a priori-style, bottom-up approach as CLIQUE to mine significant subspaces. Entropy can be used to measure all three clusterability criteria: coverage, density, and correlation. The search is accomplished using the downward closure property of entropy and the upward closure property of correlation to find minimally correlated subspaces. If a subspace is highly correlated, all of its superspaces must not be minimally correlated. Since non–minimally correlated subspaces might be of interest, ENCLUS searches for interesting subspaces by calculating interest gain and finding subspaces whose entropy exceeds a certain threshold. Once interesting subspaces are found, clusters can be identified using the same methodology as CLIQUE or any other existing clustering algorithms.

Merging of Adaptive Finite Intervals (and more than a CLIQUE) (MAFIA) extends CLIQUE by using an adaptive grid based on the distribution of data to improve efficiency and cluster quality [59]. MAFIA initially creates a histogram to determine the minimum number of bins for a dimension. The algorithm then combines adjacent cells of similar density to form larger cells. In this manner, the dimension is partitioned based on the data distribution, and the resulting boundaries of the cells capture the cluster perimeter more accurately than fixed-sized grid cells. Once the bins have been defined, MAFIA proceeds much like CLIQUE, using an a priori-style algorithm to generate the list of clusterable subspaces by building up from one dimension.

Cell-based clustering (CBF) addresses scalability issues associated with many bottom-up algorithms [60]. One problem for other bottom-up algorithms is that the number of bins created increases dramatically as the number of dimensions increases. CBF uses a cell creation algorithm that creates optimal partitions based on minimum and maximum values on a given dimension, which results in the generation of fewer bins. CBF also addresses scalability with respect to the number of instances in the data set, since other approaches often perform poorly when the data set is too large to fit in the main memory. CBF stores the bins in an efficient filtering-based index structure, which results in improved retrieval performance.

A cluster tree (CLTree) is built using a bottom-up strategy, evaluating each dimension separately and using dimensions with high density in further steps [61]. It uses a modified decision tree algorithm to adaptively partition each dimension into bins, separating areas of high density from areas of low density. The decision tree splits correspond to the boundaries of bins.

Density-Based Optimal Projective Clustering (DOC) is a Monte Carlo algorithm combining the grid-based approach used by the bottom-up approaches and the iterative improvement method from the top-down approaches [62].

Projected clustering (PROCLUS) is a top-down subspace clustering algorithm [63]. Similar to CLARANS, PROCLUS samples the data and then selects a set of k medoids and iteratively improves the clustering. The three phases of PROCLUS are as follows. (1) Initialization phase: Select a set of potential medoids that are far apart using a greedy algorithm. (2) Iteration phase: Select a random set of k medoids from this reduced data set to determine if clustering quality improves by replacing current medoids with randomly chosen new medoids. Cluster quality is based on the average distance between instances and the nearest medoid. For each medoid, a set of dimensions is chosen whose average distances are small compared to statistical expectation. Once the subspaces have been selected for each medoid, average Manhattan segmental distance is used to assign points to medoids, forming clusters. (3) Refinement phase: Compute a new list of relevant dimensions for each medoid based on the clusters formed and reassign points to medoids, removing outliers. The distance-based approach of PROCLUS is biased toward clusters that are hype-spherical in shape. PROCLUS is actually somewhat faster than CLIQUE due to the sampling of large data sets; however, using a small number of representative points can cause PROCLUS to miss some clusters entirely.

An oriented projected cluster (ORCLUS) looks for non–axis-parallel subspaces [64]. Similarly to References 3 and 14, this method is based on the observation that many data sets contain interattribute correlations. The algorithm can be divided into three steps. (1) Assign phase: Assign data points to the nearest cluster centers. (2) Subspace determination: Redefine the subspace associated with each cluster by calculating the covariance matrix for a cluster and selecting the orthonormal eigenvectors with the smallest eigenvalues. (3) Merge phase: Merge cluster that are near each other and have similar directions of least spread.

The Fast and Intelligent Subspace Clustering Algorithm Using Dimension Voting (FINDIT) uses dimension-oriented distance (DOD) to count the number of dimensions in which two points are within a threshold distance of each other, based on the assumption that in higher dimensions, it is more meaningful for two points to be close in several dimensions rather than in a few [65]. The algorithm has three phases. (1) Sampling phase: Select two small sets generated through random sampling to determine initial representative medoids of the clusters. (2) Cluster forming phase: Find correlated dimensions using the DOD measure for each medoid. (3) Assignment phase: Assign instances to medoids based on the subspaces found. FINDIT employs sampling techniques like the other top-down algorithms to improve performance with very large data sets.

The δ-clusters algorithm uses a distance measure to capture the coherence exhibited by a subset of instances on a subset of attributes [66]. Coherent instances may not be close, but instead, both follow a similar trend, offset from each other. One coherent instance can be derived from another by shifting by an offset. The algorithm starts with initial seeds and iteratively improves the overall quality of the clustering by randomly swapping attributes and data points to improve individual clusters. Residue measures the decrease in coherence that a particular entry (attribute or instance) brings to the cluster. The iterative process terminates when individual improvement levels off in each cluster.

Clustering on Subsets of Attributes (COSA) assigns weights to each dimension for each instance, not each cluster [67]. Starting with equally weighted dimensions, the algorithm examines the k-NNs of each instance. These neighborhoods are used to calculate the respective dimension weights for each instance, with higher weights assigned to those dimensions that have a smaller dispersion within the k-NN group. These weights are used to calculate dimension weights for pairs of instances, which are used to update the distances used in the k-NN calculation. The process is repeated using the new distances until the weights stabilize. The neighborhoods for each instance become increasingly enriched with instances belonging to its own cluster. The dimension weights are refined as the dimensions relevant to a cluster receive larger weights. The output is a distance matrix based on weighted inverse exponential distance and is suitable as input to any distance-based clustering method. After clustering, the weights for each dimension of cluster members are compared, and an overall importance value for each dimension is calculated for each cluster.

There are many others, which are reviewed in Reference 61. The multilevel Mahalanobis-based dimensionality reduction (MMDR) [68] clusters high-dimensional data sets using the low-dimensional subspace based on the Mahalanobis rather than the Euclidean distance, since it is argued that locally correlated clusters are elliptical shaped. The Mahalanobis distance uses different coefficients according to the dimension [8].

3.4 Steps in Building an Index for k-NN Queries

The following steps were followed combining SVD and clustering in References 3 and 14 for building the data structures to carry out k-NN queries.

Selecting an objective function in step 1
Selecting the number of clusters in step 2
Partitioning the data set of points in step 3
Rotating each partition into an uncorrelated frame of reference in step 4
Reducing the dimensionality of the partitions in step 5
Constructing the within-cluster index in step 6

A detailed description of the steps of building an index for k-NN queries is as follows.

CSVD supports three alternative objective functions: (1) the index space compression, (2) the NMSE (defined in Equation 3.5), and (3) desired recall.
The compression of the index space is defined as the ratio of the original size of the database (equal to N · M) to the size of its CSVD description, which is specified as follows:
$V = N \cdot H + \sum_{h = 1}^{H} (N \cdot p_{h} + m_{h} \cdot p_{h}),$ (3.8)
where mh is the number of points, ph is the number of dimensions retained in the hth cluster, and H is the number of clusters. The term N · H accounts for the space required by the centroids, while N · ph is the space occupied by the projection matrix and mh · ph is the space required by the projected points.
The system selects the number of clusters H as a function of the database size and of the objective function. Alternatively, the desired number of clusters can be specified by the user. Some clustering algorithms determine the desired number of clusters.
The partitioning step divides the row of the database table X into H clusters: X(h), h = 1, …, H, each containing vectors that are close to each other with m1, …, mH points. CSVD supports a wide range of classical clustering methods such as the Linde–Buzo–Gray (LBG) algorithm [69] (the default option in Reference 3) and the k-means methods [34]. For each cluster, this step also yields the centroid μ(h) and the cluster radius R(h), defined as the distance between the centroid and the farthest point of the cluster.
The rotation step is carried out separately for each of the clusters X(h), h = 1, …, H. The corresponding centroid μ(h) is subtracted from each group of vectors X(h), and the eigenvector matrix V (h) and the eigenvalues $λ_{1}^{(h)}, \dots, \geq λ_{N}^{(h)}$ are computed as described in Section 3.2. Eigenvalues are ordered by decreasing magnitude. The vectors of X(h) are rotated in the uncorrelated reference frame, having the eigenvectors as coordinate axes to produce ${\tilde{X}}^{(h)}$ .
The dimensionality reduction step is a global procedure, which depends on the selected objective function. The products of the eigenvalues $λ_{i}^{(h)}$ and the number of points of the corresponding clusters mh are computed and sorted in ascending order to produce an ordered list $ℒ$ of H · N elements; each element j of the list contains the eigenvalue-cluster size product, the label κj of the corresponding cluster, and the dimension ∂j associated with the eigenvalue. The list $ℒ$ is walked starting from its head. During step j, the $\partial_{j}^{t h}$ dimension of the $κ_{j}^{t h}$ cluster is discarded. The process ends when the target value of the objective function is reached.
The NMSE for a clustered data set is given as follows:
$N M S E = \frac{\sum_{h = 1}^{H} \sum_{i = 1}^{m_{h}} \sum_{j = n_{h} + 1}^{N} {(y_{i, j}^{(h)})}^{2}}{\sum_{h = 1}^{H} \sum_{i = 1}^{m_{h}} \sum_{j = 1}^{N} {(y_{i, j}^{(h)})}^{2}} = \frac{\sum_{h = 1}^{H} m_{h} \sum_{j = n_{h} + 1}^{N} λ_{j}^{(h)}}{\sum_{h = 1}^{H} m_{h} \sum_{j = 1}^{N} λ_{j}^{(h)}} .$ (3.9)
Index space compression: The index volume is computed as successive dimensions in the transposed space are removed.
TNMSE: The target NMSE (TNMSE) is recomputed as successive dimensions are omitted.
Recall: As dimensions are omitted, an experiment is run to determine the recall with a sufficiently large number of sample queries [3].
The within-cluster index-construction step operates separately on each cluster Y(h). Following data reduction, each cluster is much more amenable to efficient indexing than the entire table. The OP-tree proposed in Reference 2 and utilized in Reference 3 is very efficient for k-NN search in medium- and even high-dimensionality spaces.

3.5 Nearest Neighbors Queries in High-Dimensional Space

Queries on high-dimensional data can be classified as follows [70]: (1) point queries to determine if an object with a specified feature vector is available in the database, (2) range queries to determine if objects within a certain distance are available in the database, (3) k-NN queries, and (4) spatial join an all closest pairs of points.

Similarity between two objects U and V, u = [u1, …, uN]T and v = [v1, …, vN]T, can be measured by their Euclidean distance:

$D (u, v) = {[{(u - v)}^{T} (u - v)]}^{1 / 2} = {[{‖ u ‖}^{2} + {‖ v ‖}^{2} - 2 u^{T} v]}^{1 / 2} = \sqrt{\sum_{i = 1}^{n} {(u_{i} - v_{i})}^{2}}$ . (3.10)

Given precomputed vector norms, we need to compute only the inner product of the two vectors.

Similarity search based on k-NN queries, which determine the k images that are closest to a target image, is expensive when the number of dimensions per image (N) is in the hundreds and the number of images (M) is in the millions. The k-NNs can be determined by a sequential scan of an M × N matrix X. Retrieved points are inserted into a heap, as long as they are no further than the k nearest points extracted so far. Determining k-NN queries in high dimensions is not a trivial task, due to “the curse of dimensionality” [71].

As dimensionality increases, all points begin to appear almost equidistant from one another. They are effectively arranged in a d-dimensional sphere around the query, no matter where the query (point) is located. The radius of the sphere increases, while the width of the sphere remains unchanged, and the space close to the query point remains unchanged [71].

Three oddities associated with high-dimensional spaces are as follows:

A 2d partition of a unit hypercube has a limited number regions for small d, but with d = 100 there are 2100 ≈ 1030 regions, so that even with billions of points almost all of the regions are empty.
A 0.95 × 0.95 square enclosed in a 1 × 1 square occupies 90.25% of its area, while for 100 dimensions, 0.95100 ≈ 0.00592.
A circle with radius 0.5 enclosed in a unit square occupies 3.14159/4 = 78.54% of its area. For d = 40, the volume of the hypersphere is 3.278 × 1021. All the space is in the corners, and approximately 3 × 1020 points are required to expect one point inside the sphere.

Nearest neighbor queries follow a branch-and-bound algorithm. When the search begins, the feasible region of the search problem is the entire space, and the partition is given by the clustering. The target function (the distance of the kth neighbor) is lower-bounded by the distance from the clusters; during the search, the target function is upper-bounded by the distance of the current kth neighbor, and the feasible regions are pruned by implicitly discarding the clusters having a distance larger than the running upper bound.

The following steps are used in carrying out the k-NN queries:

Preprocessing and primary cluster identification.
Computation of distances from clusters: The distance between the preprocessed query point q and cluster i is defined as
max{[D(q, μ(i) − R(i)],0}.
The clusters are sorted in increasing order of distance, with the primary cluster in first position.
Searching the primary cluster: This step produces a list of k results, in increasing order of distance; let dmax be the distance of the farthest point in the list.
Searching the other clusters: If the distance to the next cluster does not exceed dmax, then the cluster is searched, otherwise, the search is terminated. While searching a cluster, if points closer to the query than dmax are found, they are added to the list of k current best results, and dmax is updated.
Postprocessing takes advantage of Lower-Bounding Property (LBP) [8] (lemma 1 in Reference 72):
Given that the distance of points in the subspace with reduced dimensions is less than the original distance, a range query guarantees no false dismissals.

False alarms are discarded by referring to the original data set. Noting the relationship between range and k-NN queries, the latter can be processed as follows [72]: (1) Find the k-NNs of the query point Q in the subspace. (2) Determine the farthest actual distance to Q among these k points (dmax). (3) Issue a range query centered on Q on the subspace with radius $ε = d_{m a x}$ . (4) For all points obtained in this manner, find their original distances to Q, by referring to the original data set and ranking the points, that is, select the k-NNs.

The exact k-NN processing method was extended to multiple clusters in Reference 30, where we compare the CPU cost of the two methods as the NMSE is varied using two data sets. An offline experiment was used to determine k*, which yields a recall R ≈ 1. The CPU time required by the exact method for a sequential scan is lower than the approximate method, even when $ℛ = 0.8$ . This is attributable to the fact that the exact method issues a k-NN query only once, and this is followed by less costly range queries. Optimal data reduction is studied in the context of the exact query processing [73].

RCSVD was explored in Reference 14 in the context of sequential data set scans. The OP-tree index [2], which is a main memory index, which is suited for k-NN queries, is utilized in Reference 3. Two methods to make this index persistent are described in References 74 and 75. We also proposed a new index, the stepwise dimensionally increasing (SDI) index, which is shown to outperform other indices [74].

3.6 Alternate Method Combining SVD and Clustering

The LDR method described in Reference 15 generates SVD-friendly clusters so that a higher-dimensionality reduction can be attained for the same NMSE. The description of the LDR method in what follows is based on the appendix in Reference 3.

Initial selection of centroids. H points with pairwise distances of at least threshold are selected.
Clustering. Each point is associated to the closest centroid, whose distance is less than ε or ignored otherwise. The coordinates of the centroid are the average of the associated points.
Apply SVD to each cluster. Each cluster now has a reference frame.
Assign points to clusters. Each point in the data set is analyzed in the reference frame of each cluster. For each cluster, determine the minimum number of retained dimensions so that the squared error in representing points does not exceed MaxReconDist2. The cluster requiring the minimum number of dimensions Nmin is determined. If Nmin ≤ MaxDim, the point is associated with that cluster, and the required number of dimensions is recorded; otherwise, the point becomes an outlier.
Compute the subspace dimensionality of each cluster. Find the minimum number of dimensions ensuring that at most, FracOutliers of the points assigned to the cluster in step 4 do not satisfy the MaxReconDist2 constraints. Each cluster now has a subspace.
Recluster points. Each point is associated with the first cluster, whose subspace is at a distance less than MaxReconDist from the point.
Remove small clusters. Clusters with size smaller than MinSize are removed and the corresponding points reclustered in step 6.
Recursively apply steps 1–7 to the outlier set. Make sure that the centroids selected in step 1 have a distance at least threshold from the current valid clusters. The procedure terminates when no new cluster is identified.
Create an outlier list. As in Reference 18, this is searched using a linear scan. Varying the NMSE may yield different compression ratios. The smaller the NMSE, the higher the number of points that cannot be approximated accurately.

The values for H, threshold, ε, MaxReconDist, MaxDim, FracOutliers, and MinSize need to be provided by the user. This constitutes the main difficulty with applying the LDR method. The invariant parameters are MaxReconDist, MaxDim, and MinSize. The method can produce values of H, threshold, and overall FracOutliers, which are significantly different from the inputs. The fact that the LDR method determines the number of clusters is one of its advantages with respect to the k-means method.

The LDR method also takes advantage of LBP [8,72] to attain exact query processing. It produces fewer false alarms by incorporating the reconstruction distance (the squared distance of all dimensions that have been eliminated) as an additional dimension with each point. It is shown experimentally in Reference 3 that RCSVD outperforms LDR in over 90% of cases.

3.7 Survey of High-Dimensional Indices

Multidimensional indexing has been an active research area in the last two decades [8,16,17,70,76]. Given the distinction between point access methods (PAMs) and spatial access methods (SAMs) [16], we are interested here in PAMs for high-dimensional data. Multidimensional indexing structures can be classified as data partitioning (DP), such as R-trees, similarity search (SS)-trees, and spherical–rectangular (SR)-trees, and space partitioning (SP), such as k-d-B-trees and holey Brick (hB) trees [70]. In DP, space is partitioned in a manner such that regions have an approximately equal number of points. SP divides space into nonoverlapping regions, so that any point in the space can then be identified to lie in exactly one of the regions. Another categorization is disk-resident versus main memory-resident indices. In the former (respectively latter) case, the number of disk pages accessed and CPU time are the major performance measures.

Quad trees [77] and k-d-trees [49] are two early main memory-resident indexing structures. k-d-B-trees combine k-d-trees and B-trees [78]. R-trees [79] were designed to handle spatial data on secondary storage. R+-trees [80] and R*-trees [81] are two improved versions of R-trees. SS-trees [5] use hyperspheres to partition the space, while SR-trees [82] use the intersection of hyperspheres and hyperrectangles to represent regions. SR-trees outperform both R*-trees and SS-trees.

It is known that the nearest neighbor search algorithms based on the Hjaltason and Samet (HS) method [70,83] outperform the Roussopoulos, Kelley, and Vincent (RKV) method [84]. This was also ascertained experimentally as part of carrying out experimental studies on indexing structures in Reference 85. The RKV algorithm provided with SR-tree code was replaced with the HS algorithm in this study. In computing distances, we used the method specified in Reference 73, which is more efficient than the method in Reference 18. This method is specified in Section 3.8.

With the increasing dimensionality of feature vectors, most multidimensional indices lose their effectiveness. The so-called dimensionality curse [8] is due to an increased overlap among the nodes of the index and a low fan-out, which results in increased index height. For a typical feature vector-based hierarchical index structure, each node corresponds to an 8-kilobyte (KB) page. Given a fixed page size S (minus space dedicated to bookkeeping), number of dimensions N, and s = sizeof(data type), the fan-out for different index structures is as follows:

Hyperrectangles: R*-trees consist of a hierarchy of hyperrectangles, with those at the higher levels embedding lower-level ones [81]. A hyperrectangle is specified uniquely by the lower left and upper right coordinates in N-dimensional space. Alternatively, the centroid and its distance from the N “sides” of the hyperrectangle may be specified. The cost is the same in both cases, so that the fan-out is F ≈ S/(2 × N × s). Implementing an R*-tree as an SAM, rather than a PAM, represents points as hyperrectangles, doubling the cost of representing points.

Hyperspheres: SS-trees consist of a tree of hyperspheres [5], with each node embedding the nodes at the lower level. SS-trees have been shown to outperform R-trees. Each hypersphere is represented by its centroid in N-dimensional space and its radius, that is, the fan-out is F ≈ S/((N + 1) × s).

Hyperspheres and hyperrectangles: SR-trees [82] combine SS-trees with R-trees, which encapsulate all of the points in the index. The region of the index is the intersection of the bounding hyperrectangle and the bounding hypersphere, which results in a significant reduction in the size of the region, since the radius of the hypersphere is determined by the distance of the farthest point from its centroid. The space requirement per node is the sum of the space requirements in SS-trees and R*-trees, so that the fan-out is F = S/((3 × N + 1)s).

Hybrid trees, proposed in Reference 86, in conjunction with the LDR method discussed in Section 3.6, combine the positive aspects of DP and SP indexing structures into one. Experiments on “real” high-dimensional large-size feature databases demonstrate that the hybrid tree scales well to high dimensionality and large database sizes. It significantly outperforms both purely DP-based and SP-based index mechanisms as well as linear scan at all dimensionalities for large-sized databases.

X-trees combine linear and hierarchical structures using supernodes to avoid overlaps to attain improved performance for k-NN search for higher dimensions [87].

Pyramid trees [88] are not affected by the curse of dimensionality. The iMinMax(θ) [89] method, which maps points from high- to single-dimensional space, outperforms pyramid tress in experiments for range queries.

iDistance is an efficient method for k-NN search in a high-dimensional space [90,91]. Data are partitioned into several clusters, and each partition has a reference point, for example, the centroids of each cluster obtained using the k-means clustering method [40]. The data in each cluster are transformed into a single-dimensional space according to the similarity with respect to a reference point. The one-dimensional values of different clusters are disjoint. The one-dimensional space is indexed by a B+ index, and k-NN search is implemented using range searches. The search starts with a small radius, but it is increased step by step to form a bigger query sphere. iDistance is lossy since multiple data points in the high-dimensional space may be mapped to the same value in the single-dimensional space. The SDI method is compared with the iDistance method in Reference 74.

A vector approximation file (VA-file) represents each data object using the cell into which it falls [9]. Cells are defined by a multidimensional grid, where dimension i is partitioned $2^{b_{i}}$ ways. Due to the sparsity of high-dimensional space, it is very unlikely that several points can share a cell. Nearest neighbor search sequentially scans the VA-file to determine the upper-bound and lower-bound distance from the query to each cell. This is followed by a refinement or filtering step; if the approximate lower bound is greater than the current upper bound, it is not considered further. Otherwise, it is a candidate. During the refine step, all the candidates are sorted according to their lower-bound distances.

A metric tree transforms the feature vector space into metric space and then indexes the metric space [92]. A metric space is a pair, $ℳ = (ℱ, d)$ , where $ℱ$ is a domain of feature value, and d is a distance function with the following properties:

Symmetry d(Fx, Fy) = d(Fy, Fx)
Nonnegativity d(Fx, Fy) > 0, Fx ≠ Fy
Triangle inequality d(Fx, Fy) ≤ d(Fx, Fz) + d(Fz, Fy)

The search space is organized based on relative distances of objects, rather than their absolute positions in a multidimensional space.

A vantage point (VP)-tree [93] partitions a data set according to distances between the objects and a reference or vantage point. The corner point is chosen as the vantage point, and the median value of the distances is chosen as separating radius to partition the data set into two balanced subsets. The same procedure is applied recursively on each subset. The multiple vantage point (mvp)-tree [94] uses precomputed distances in the leaf nodes to provide further filtering during search operations. Both trees are built in a top-down manner; balance cannot be guaranteed during insertion and deletion. Costly reorganization is required to prevent performance degradation.

An M-tree is a paged metric-tree index [95]. It is balanced and able to deal with dynamic data. Leaf nodes of an M-tree store the feature vectors of the indexed objects Oj and distances to their parents, whereas internal nodes store routing objects Or, distances to their parents Op, covering radii r(Or), and corresponding covering tree pointers. The M-tree reduces the number of distance computations by storing distances.

The OMNI-family is a set of indexing methods based on the same underlying theory that all the points Si located between ℓ and upper u radii are candidates for a spherical query with radius r and given point Q for a specific focus Fi, where ℓ = d(Q, Fi) − r, u = d(Q, Fi) + r [96]. For multiple foci, the candidates are the intersections of Si. Given a data set, a set of foci is to be found. For each point in the data set, the distance to all of the foci is calculated and stored. The search process can be applied to sequential scan, B+-trees, and R-trees. For B+-trees, the distances for each focus Fi are indexed, a range query is performed on each index, and finally, the intersection is obtained. For R-trees, the distances for all the foci, which form lower-dimensional data, are indexed, and a single range query is performed.

A Δ-tree is a main memory-resident index structure, which represents each level with a different number of dimensions [97]. Each level of the index represents the data space starting with a few dimensions and expanding to full dimensions, while keeping the fan-out fixed. The number of dimensions increases toward the leaf level, which contains full dimensions of the data. This is intended to minimize the cache miss ratio as the dimensionality of feature vectors increases. At level ℓ ≥ 1, the number of dimensions is selected as the smallest nℓ satisfying

$\sum_{k = 1}^{n_{ℓ}} λ_{k} / \sum_{k = 1}^{N} λ_{k} \geq m i n (ℓ p, 1) .$

The nodes of the index increase in size from the highest level to the lowest level, and the tree may not be height balanced, but this is not a problem since the index is main memory resident. The index with shorter feature vector lengths attains a lower cache miss ratio, reduced cycles per instruction (CPI), and reduced CPU time [98].

The telescopic vector (TV)-tree is a disk-resident index with nodes corresponding to disk pages [99]. TV-trees partition the data space using telescopic minimum bounding regions (TMBRs), which have telescopic vectors as their centers. These vectors can be contracted and extended dynamically by telescopic functions defined in Reference 99, only if they have the same number of active dimensions (α). Features are ordered using the KLT applied to the whole data set, so that the first few dimensions provide the most discrimination. The discriminatory power of the index is heavily affected by the value of the parameter α, which is difficult to determine. In case the number of levels is large, the tree will still suffer from the curse of dimensionality. The top levels of TV-trees have higher fan-outs, which results in reducing the input/output (I/O) cost for disk accesses. Experimental results on a data set consisting of dictionary words are reported in Reference 99.

The OP-tree index described in Reference 2 for efficient processing of k-NN queries recursively equipartitions data points one dimension at a time in a round-robin manner until the data points can fit into a leaf node. The two properties of OP-trees are ordering and partitioning. Ordering partitions the search space, and partitioning rejects unwanted space without actual distance computation. A fast k-NN search algorithm by reducing the distance based on the structure is described. Consider the following sample data set:

1: (1, 2, 5), 2: (3, 8, 7), 3: (9, 10, 8), 4: (12, 9.2), 5: (8, 7, 20), 6: (6.6.23),

7: (0, 3, 27), 8: (2, 13.9), 9: (11, 11, 15), 10: (14, 17, 13), 11: (7, 14, 12), 12: (10, 12, 3).

As far as the first partition is concerned, we partition the points into three regions, R1(−∞, 3), R2(3, 6), R3(6, +∞).

R1 is partitioned into subregions R1,1(−∞, 2), R1,2(2, 8), R1,3(8, ∞);

R2 is partitioned into subregions R2,1(−∞, 4), R2,2(4, 7), R2,3(7, +∞); and

R3 is partitioned into subregions R3,1(−∞, 9), R3,2(9, 12), R3,3(12, +∞).

We next specify the points held by the partitions

R1,2(1); R2,1(2, 7); R3,1(8);

R2,1(11); R2,2(5, 6); R2,3(3);

R3,1(4); R3,2(12), (9); and R3,3(10).

The building of a persistent semi-dynamic OP-tree index is presented in References 74 and 75. We use serialization to compact the dynamically allocated nodes of the OP-tree in the main memory, which form linked lists, into a contiguous area. The index can then be saved onto disk as a single file and loaded into the main memory with a single data transfer. This is because disk positioning time to load the index one page at a time is high and is improving at a very slow rate, while the improvement in disk transfer rates is significant [11]. The prefetching effect of modern operating systems results in the loading of the whole index into the main memory in experimental studies with small indices. The original OP-tree is static, in which case the OP-tree has to be rebuilt in the main memory as new points are added. We propose and compare the performance and space efficiency of several methods, which support inserting points dynamically.

The SDI index uses a reduced number of dimensions at the higher levels of the tree to increase the fan-out, so that the tree height is reduced [100]. The number of dimensions increases level by level, until full dimensionality is attained at the lower levels. The SDI index differs from the Δ-tree in that it is a disk-resident index structure with fixed node sizes, while the Δ-tree is a main memory-resident index with variable node sizes and fixed fan-outs. The SDI-tree differs from the TV-tree in that it uses a single parameter, specifying the fraction of variance to be added to each level, without the risk of having a large number of active dimensions.

SDI trees are compared to variance approximate median (VAM) SR-trees and the VA-file indexes in Reference 100. The VAMSR-tree uses the same split algorithm as the VAMSplit R-tree [101], but it is based on an SR-tree structure, which is statically built in a bottom-up manner. The data set is recursively split top down using the dimension with the maximum variance and choosing a pivot, which is approximately the median.

3.8 Conclusions

Similarity search via k-NN queries applied to feature vectors associated to images is a well-known paradigm in CBIR. Applying k-NN queries to a large data set, where the number of images is in the millions and the number of features is in the hundreds, is quite costly, so we present a combination of dimensionality reduction via SVD and related methods, clustering, and indexing to achieve a higher efficiency in k-NN queries.

SVD eliminates dimensions with little variance, which has little effect on the outcome of k-NN queries, although this results in an approximate search, so that precision and recall are required to quantify this effect. Dimensionality reduction allows more efficiency in applying clustering and building index structures. Both clustering and indexing reduce the cost of k-NN processing by reducing the number of points to be considered. Applying clustering before indexing allows more efficient dimensionality reduction, but in fact, we have added an extra indexing level.

The memory-resident OP-tree index, which was used in our earlier studies, was made disk resident. Unlike the pages of R-tree-like indices, which can be loaded one page at a time, the OP-tree index can be accessed from disk with one sequential access. The SDI index was also shown to outperform well-known indexing methods.

There are several interesting problems that are not covered in this discussion. The issue of adding points to the data set without the need for repeating the application of SVD is addressed in References 75 and 102. PCA can be applied to very large M × N data sets by computing the N × N covariance matrix after a single data pass. The eigenvectors computed by applying PCA to an N × N matrix can be used to compute the rotated matrix with a reduced number of dimensions. There is also the issue of applying SVD to very large data sets, which may be sparse [103,104].

Figures and experimental results are not reported for the sake of brevity. The reader is referred to the referenced papers for figures, graphs, and tables.

Acknowledgments

The work reported here is based mainly on the author’s collaboration with colleagues Dr. Vittorio Castelli and Dr. Chung-Sheng Li at the IBM T. J. Watson Research Center in Yorktown Heights, New York, and PhD students Dr. Lijuan (Catherine) Zhang and Dr. Yue Li at the New Jersey Institute of Technology (NJIT), Newark, New Jersey.

References

1. W. Niblack, R. Barber, W. Equitz, M. Flickner, E. H. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, and G. Taubin. “The QBIC project: Querying images by content, using color, texture, and shape.” In Proc. SPIE Vol. 1908: Storage and Retrieval for Image and Video Databases, San Jose, CA, January 1993, pp. 173–187.

2. B. Kim, and S. Park. “A fast k-nearest-neighbor finding algorithm based on the ordered partition.” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 8(6): 761–766 (1986).

3. V. Castelli, A. Thomasian, and C. S. Li. “CSVD: Clustering and singular value decomposition for approximate similarity search in high dimensional spaces.” IEEE Transactions on Knowledge and Data Engineering (TKDE) 14(3): 671–685 (2003).

4. C.-S. Li, and V. Castelli. “Deriving texture feature set for content-based retrieval of satellite image database.” In Proc. Int’l. Conf. on Image Processing (ICIP ’97), Santa Barbara, CA, October 1997, pp. 576–579.

5. D. A. White, and R. Jain. “Similarity indexing with the SS-tree.” In Proc. 12th IEEE Int’l. Conf. on Data Engineering (ICDE), New Orleans, LA, March 1996, pp. 516–523.

6. V. Castelli, and L. D. Bergman (editors). Image Databases: Search and Retrieval of Digital Imagery. John Wiley and Sons, New York, 2002.

7. B. S. Manjunath, and W.-Y. Ma. “Texture features for image retrieval.” In Image Databases: Search and Retrieval of Digital Imagery, V. Castelli, and L. D. Bergman (editors). Wiley-Interscience, 2002, pp. 313–344.

8. C. Faloutsos. Searching Multimedia Databases by Content (Advances in Database Systems). Kluwer Academic Publishers (KAP)/Elsevier, Burlingame, MA, 1996.

9. R. Weber, H.-J. Schek, and S. Blott. “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces.” In Proc. 24th Int’l. Conf. on Very Large Data Bases (PVLDB), New York, August 1998, pp. 194–205.

10. R. F. Freitas, and W. W. Wilcke. “Storage-class memory: The next storage system technology.” IBM Journal of Research and Development 52(4–5): 439–448 (2008).

11. B. Jacob, S. W. Ng, and D. T. Wang. Memory Systems: Cache, DRAM, Disk. Morgan Kauffman Publishers (MKP)/Elsevier, Burlingame, MA. 2008.

12. I. T. Jolliffe. Principal Component Analysis. Springer, New York, 2002.

13. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes, 3rd Edition: The Art of Scientific Computing. Cambridge University Press (CUP), Cambridge, UK, 2007.

14. A. Thomasian, V. Castelli, and C. S. Li. “RCSVD: Recursive clustering and singular value decomposition for approximate high-dimensionality indexing.” In Proc. 7th ACM Int’l. Conf. on Information and Knowledge Management (CIKM), Baltimore, MD, November 1998, pp. 201–207.

15. K. Chakrabarti, and S. Mehrotra. “Local dimensionality reduction: A new approach to indexing high dimensional space.” In Proc. Int’l. Conf. on Very Large Data Bases (PVLDB), Cairo, Egypt, August 2000, pp. 89–100.

16. V. Gaede, and O. Günther. “Multidimensional access methods.” ACM Computing Surveys 30(2): 170–231 (1998).

17. V. Castelli. “Multidimensional indexing structures for content-based retrieval.” In Image Databases: Search and Retrieval of Digital Imagery, V. Castelli, and L. D. Bergman (editors). Wiley-Interscience, Hoboken, NJ, pp. 373–434.

18. F. Korn, H. V. Jagadish, and C. Faloutsos. “Efficiently supporting ad hoc queries in large datasets of time sequences.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Tucson, AZ, May 1997, pp. 289–300.

19. R. Ramakrishnan, and J. Gehrke. Database Management Systems, 3rd edition. McGraw-Hill, New York, 2003.

20. J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann Publishers (MKP)/Elsevier, Burlingame, MA, 2011.

21. W. Barbara, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. “The New Jersey data reduction report.” Data Engineering Bulletin 20(4): 3–42 (1997).

22. S. Mallat. A Wavelet Tour of Signal Processing: The Sparse Way. MKP, 2008.

23. K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. “Approximate query processing using wavelets.” In Proc. Int’l. Conf. on Very Large Data Bases (PVLDB), Cairo, Egypt, August 2000, pp. 111–122.

24. Y. Bishop, S. Fienberg, and P. Holland. Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA, 1975.

25. V. Poosala. “Histogram-based estimation techniques in databases.” PhD Thesis, Univ. of Wisconsin-Madison, Madison, WI, 1997.

26. F. Olken. “Random sampling from databases.” PhD Dissertation, University of California, Berkeley, CA, 1993.

27. J. M. Hellerstein, P. J. Haas, and H. J. Wang. “Online aggregation.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Tucson, AZ, May 1997, pp. 171–182.

28. A. Rajaraman, J. Leskovec, and J. Ullman. Mining of Massive Datasets, 1.3 edition. Cambridge University Press (CUP), Cambridge, UK, 2013. Available at http://infolab.stanford.edu/~ullman/mmds.html.

29. IBM Corp. Engineering and Scientific Subroutine Library (ESSL) for AIX V5.1, Guide and Reference. IBM Redbooks, Armonk, NY. SA23-2268-02, 07/2012.

30. A. Thomasian, Y. Li, and L. Zhang. “Exact k-NN queries on clustered SVD datasets.” Information Processing Letters (IPL) 94(6): 247–252 (2005).

31. J. A. Hartigan. Clustering Algorithms. John Wiley and Sons, New York, 1975.

32. A. K. Jain, and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Upper Saddle River, NJ, 1988.

33. L. Kaufman, and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, New York, 1990.

34. B. S. Everitt, S. Landau, M. Leese, and D. Stahl. Cluster Analysis, 5th edition. John Wiley and Sons, New York, 2011.

35. M. J. Zaki, and W. Meira Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press (CUP), Cambridge, UK, 2013.

36. H. P. Kriegel, P. Krger, and A. Zimek. “Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering.” ACM Transactions on Knowledge Discovery from Data (TKDD) 3(1): 158 (2009).

37. C. Aggarwal, A. Hinneburg, and D. A. Keim. “On the surprising behavior of distance metrics in high dimensional spaces.” In Proc. Int’l. Conf. on Database Theory (ICDT), London, January 2001, pp. 420–434.

38. C. C. Aggarwal. “Re-designing distance functions and distance-based applications for high dimensional data.” ACM SIGMOD Record 30(1): 13–18 (2001).

39. J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, 1981. Available at http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/cmeans.html.

40. F. Farnstrom, J. Lewis, and C. Elkan. “Scalability for clustering algorithms revisited.” ACM SIGKDD Explorations Newsletter 2(1): 51–57 (2000).

41. T. F. Gonzalez. “Clustering to minimize the maximum intercluster distance.” Theoretical Computer Science 38: 293–306 (1985).

42. A. McCallum, K. Nigam, and L. H. Unger. “Efficient clustering of high-dimensional data sets with applications to reference matching.” In Proc. 6th ACM Int’l. Conf. on Knowledge Discovery and Data Mining (KDD), Boston, August 2000, pp. 169–178.

43. R. T. Ng, and J. Han. “CLARANS: A method for clustering objects for spatial data mining.” IEEE Transactions on Knowledge and Data Engineering (TKDE) 14(5): 1003–1016 (2002).

44. T. Zhang, R. Ramakrishnan, and M. Livny. “BIRCH: An efficient data clustering method for very large databases.” In Proc. 25th ACM SIGMOD Int’l. Conf. on Management of Data, Montreal, Quebec, Canada, June 1996, pp. 103–114.

45. P. S. Bradley, U. M. Fayyad, and C. Reina. “Scaling clustering algorithms to large databases.” In Proc. Int’l. Conf. on Knowledge Discovery and Data Mining, New York, August 1998, p. 915.

46. G. Karypis, E.-H. Han, and V. Kumar. “Chameleon: Hierarchical clustering using dynamic modeling.” IEEE Computer 32(8): 68–75 (1999).

47. B. W. Kernighan, and S. Lin. “An efficient heuristic procedure for partitioning graph.” Bell Systems Technical Journal (BSTJ) 49: 291–308 (1970).

48. S. Guha, R. Rastogi, and K. Shim. “CURE: An efficient clustering algorithm for large databases.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Seattle, WA, June 1998, pp. 73–84.

49. J. L. Bentley. “Multidimensional binary search trees used for associative searching.” Communications of the ACM 18(9): 509–517 (1975).

50. V. Ganti, R. Ramakrishnan, J. Gehrke, A. L. Powell, and J. C. French. “Clustering large datasets in arbitrary metric spaces.” In Proc. Int’l. Conf. on Data Engineering (ICDE), Sidney, Australia, March 1999, pp. 502–511.

51. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. “A density-based algorithm for discovering clusters in large spatial databases with noise.” In Proc. 2nd Int’l. Conf. Knowledge Discovery and Data Mining (KDD-96), Portland, OR, 1996, pp. 226–231.

52. M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. “OPTICS: Ordering points to identify the clustering structure.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Philadelphia, PA, June 1999, pp. 49–60.

53. W. Wang, J. Yang, and R. Muntz. “STING: A statistical information grid approach to spatial data mining.” In Proc. 23rd Int’l. Conf. Very Large Data Bases (VLDB), Athens, Greece, August 1997, pp. 186–195.

54. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. “Automatic subspace clustering of high dimensional data for data mining applications.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Seattle, WA, June 1998, pp. 94–105.

55. G. Sheikholeslami, S. Chatterjee, and A. Zhang. “WaveCluster: A multi-resolution clustering approach for very large spatial databases.” In Proc. 24th Int’l. Conf. Very Large Data Bases (VLDB), New York, August 1998, pp. 428–439.

56. A. Hinneburg, and D. A. Keim. “Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering.” In Proc. 25th Int’l. Conf. on Very Large Data Bases (PVLDB), Edinburgh, Scotland, UK, September 1999, pp. 506–517.

57. R. Agrawal, M. Mehta, J. Shafer, R. Srikant, A. Arning, and T. Bollinger. “The quest data mining system.” In Proc. 2nd Int’l. Conference on Knowledge Discovery in Databases and Data Mining (KDD), Portland, OR, 1996, pp. 244–249.

58. C.-H. Cheng, A. W. Fu, and Y. Zhang. “Entropy-based subspace clustering for mining numerical data.” In Proc. 5th ACM SIGKDD Int’l. Conf. on Knowledge Discovery and Data Mining (KDD), San Diego, CA, August 1999, pp. 84–93.

59. H. S. Nagesh, S. Goil, and A. Choudhary. “A scalable parallel subspace clustering algorithm for massive data sets.” In Proc. IEEE Int’l. Conf. on Parallel Processing (ICPP), Toronto, Ontario, August 2000, pp. 477–484.

60. J.-W. Chang, and D.-S. Jin. “A new cell-based clustering method for large, high-dimensional data in data mining applications.” In Proc. ACM Symposium on Applied Computing (SAC), Madrid Spain, March 2002, pp. 503–507.

61. L. Parsons, E. Haque, and H. Liu. “Subspace clustering for high dimensional data: A review.” ACM SIGKDD Explorations Newsletter 6(1): 90–105 (2004).

62. C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. “A Monte Carlo algorithm for fast projective clustering.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Madison, WI, June 2002, pp. 418–427.

63. C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Precopiuc, and J. S. Park. “Fast algorithms for projected clustering.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Philadelphia, PA, June 1999, pp. 61–72.

64. C. C. Aggarwal, and P. S. Yu. “Finding generalized projected clusters in high dimensional spaces.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Dallas, May 2000, pp. 70–81.

65. K. G. Woo, J. H. Lee, M. H. Kim, and Y. J. Lee. “FINDIT: A fast and intelligent subspace clustering algorithm using dimension voting.” Information Software Technology 46(4): 255–271 (2004).

66. J. Yang, W. Wang, H. Wang, and P. S. Yu. “δ-clusters: Capturing subspace correlation in a large data set.” In Proc. 18th Int’l. Conf. on Data Engineering (ICDE), San Jose, CA, February–March 2002, pp. 517–528.

67. J. H. Friedman, and J. J. Meulman. “Clustering objects on subsets of attributes.” 2002. Available at http://statweb.stanford.edu/~jhf/.

68. H. Jin, B. C. Ooi, H. T. Shen, C. Yu, and A. Y. Zhou. “An adaptive and efficient dimensionality reduction algorithm for high-dimensional indexing.” In Proc. 19th IEEE Int’l. Conf. on Data Engineering (ICDE), Bangalore, India, March 2003, pp. 87–98.

69. Y. Linde, A. Buzo, and R. Gray. “An algorithm for vector quantizer design.” IEEE Transactions on Communications 28(1): 84–95 (1980).

70. H. Samet. Foundations of Multidimensional and Metric Data Structure. Elsevier, 2007.

71. V. S. Cherkassky, J. H. Friedman, and H. Wechsler. From Statistics to Neural Networks: Theory and Pattern Recognition Applications. Springer-Verlag, New York, 1994.

72. F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas. “Fast and effective retrieval of medical tumor shapes: Nearest neighbor search in medical image databases.” IEEE Transactions on Knowledge and Data Engineering (TKDE) 10(6): 889–904 (1998).

73. A. Thomasian, Y. Li, and L. Zhang. “Optimal subspace dimensionality for k-nearest-neighbor queries on clustered and dimensionality reduced datasets with SVD.” Multimedia Tools and Applications (MTAP) 40(2): 241–259 (2008).

74. A. Thomasian, and L. Zhang. “Persistent semi-dynamic ordered partition index.” The Computer Journal 49(6): 670–684 (2006).

75. A. Thomasian, and L. Zhang. “Persistent clustered main memory index for accelerating k-NN queries on high dimensional datasets.” Multimedia Tools Applications (MTAP) 38(2): 253–270 (2008).

76. C. Yu. High-Dimensional Indexing: Transformational Approaches to High-Dimensional Range and Similarity Searches. Lecture Notes in Computer Science (LNCS), Volume 2431. Springer-Verlag, New York, 2002.

77. R. A. Finkel, and J. L. Bentley. “Quad trees: A data structure for retrieval of composite keys.” Acta Informatica 4(1): 1–9 (1974).

78. J. T. Robinson. “The k-d-b tree: A search structure for large multidimensional dynamic indexes.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Ann Arbor, MI, April 1981, pp. 10–18.

79. A. Guttman. “R-trees: A dynamic index structure for spatial searching.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Boston, June 1984, pp. 47–57.

80. T. Sellis, N. Roussopoulos, and C. Faloutsos. “The R+-tree: A dynamic index for multi-dimensional objects.” In Proc. 13th Int’l. Conf. on Very Large Data Bases (VLDB), Brighton, England, September 1987, pp. 507–518.

81. N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. “The R*-tree: An efficient and robust access method for points and rectangles.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Atlantic City, NJ, May 1990, pp. 322–331.

82. N. Katayama, and S. Satoh. “The SR-tree: An index structure for high-dimensional nearest neighbor queries.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Tucson, AZ, May 1997, pp. 369–380.

83. G. R. Hjaltason, and H. Samet. “Distance browsing in spatial databases.” ACM Transactions on Database Systems (TODS) 24(2): 265–318 (1999).

84. N. Roussopoulos, S. Kelley, and F. Vincent. “Nearest neighbor queries.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, San Jose, CA, June 1995, pp. 71–79.

85. L. Zhang. “High-dimensional indexing methods utilizing clustering and dimensionality reduction.” PhD Dissertation, Computer Science Department, New Jersey Institute of Technology (NJIT), Newark, NJ, May 2005.

86. K. Chakrabarti, and S. Mehrotra. “The hybrid tree: An index structure for high dimensional feature spaces.” In Proc. IEEE Int’l. Conf. on Data Engineering (ICDE), 1999, pp. 440–447.

87. S. Berchtold, D. A. Keim, and H. P. Kriegel. “The X-tree: An index structure for high-dimensional data.” In Proc. 22nd Int’l. Conf. on Very Large Data Bases (PVLDB), San Jose, CA, August 1996, pp. 28–39.

88. S. Berchtold, C. Böhm, and H.-P. Kriegel. “The pyramid-technique: Towards breaking the curse of dimensionality.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Seattle, WA, June 1998, pp. 142–153.

89. B. C. Ooi, K.-L. Tan, C. Yu, and S. Bressan. “Indexing the edges—A simple and yet efficient approach to high-dimensional indexing.” In Proc. 19th ACM Int’l. Symp. on Principles of Database Systems (PODS), Dallas, May 2000, pp. 166–174.

90. C. Yu, B. C. Ooi, K.-L. Tan, and H. Jagadish. “Indexing the distance: An efficient method to knn processing.” In Proc. 27th Int’l. Conf. on Very Large Data Bases (VLDB), Rome, Italy, September 2001, pp. 421–430.

91. H. V. Jagadish, B. C. Ooi, K. L. Tan, C. Yu, and R. Zhang. “iDistance: An adaptive B+-tree based indexing method for nearest neighbor search.” ACM Transactions on Database Systems (TODS) 30(2): 364–397 (2000).

92. J. K. Uhlmann. “Satisfying general proximity/similarity queries with metric trees.” Information Processing Letters (IPL) 40(4): 175–179 (1991).

93. P. N. Yianilos. “Data structures and algorithms for nearest neighbor search in general metric spaces.” In Proc. 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Austin, TX, January 1993, pp. 311–321.

94. T. Bozkaya, and M. Ozsoyoglu. “Distance-based indexing for high-dimensional metric spaces.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Tucson, AZ, May 1997, pp. 357–368.

95. P. Ciaccia, M. Patella, and P. Zezula. “M-tree: An efficient access method for similarity search in metric spaces.” In Proc. 23rd Int’l. Conf. on Very Large Data Bases (PVLDB), Athens, Greece, August 1997, pp. 426–435.

96. R. F. S. Filho, A. Traina, C. Traina Jr., and C. Faloutsos. “Similarity search without tears: The OMNI family of all-purpose access methods.” In Proc. 17th IEEE Int’l. Conf. on Data Engineering (ICDE), Heidelberg, Germany, April 2001, pp. 623–630.

97. B. Cui, B. C. Ooi, J. Su, and K.-L. Tan. “Indexing high-dimensional data for efficient in-memory similarity search.” IEEE Transactions on Knowledge and Data Engineering (TKDE) 17(3): 339–353 (2005).

98. J. L. Hennessey, and D. A. Patterson. Computer Architecture: A Quantitative Approach, 5th edition. Elsevier, Burlingame, MA, 2011.

99. K. I. Lin, H. V. Jagadish, and C. Faloutsos. “The TV-tree: An index structure for high-dimensional data.” VLDB Journal 3(4): 517–542 (1994).

100. A. Thomasian, and L. Zhang. “The Stepwise Dimensionality-Increasing (SDI) index for high dimensional data.” The Computer Journal 49(5): 609–618 (2006).

101. D. A. White, and R. Jain. “Similarity indexing: Algorithms and performance.” In Storage and Retrieval for Image and Video Databases (SPIE), Volume 2670, San Jose, CA, 1996, pp. 62–73.

102. K. V. Ravikanth, D. Agrawal, and A. Singh. “Dimensionality-reduction for similarity searching in dynamic databases.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Seattle, WA, June 1998, pp. 166–176.

103. S. Rajamanickam. “Efficient algorithms for sparse singular value decomposition.” PhD Thesis, Computer Science Dept., University of Florida, Gainesville, FL, 2009. Available at http://www.cise.ufl.edu/~srajaman/Rajamanickam_S.pdf.

104. A. A. Amini. “High-dimensional principal component analysis.” PhD Thesis, Report Number UCB/EECS-2011-104, EECS Dept., University of California, Berkeley, CA, 2011. Available at http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-104.html.

Appendix 3.A: Computing Approximate Distances with Dimensionality-Reduced Data

The dimensionality reduction method in References 3 and 14 retains n < N columns of the Y matrix, which correspond to the largest eigenvalues, while the method described in References 8 and 18 retains the n columns of the U matrix corresponding to the largest eigenvalues. In both cases, the eigenvalues or singular values are in nonincreasing order, so that columns with the smallest indices need to be preserved. We first show that retaining the n columns of the Y matrix results in the same distance error as retaining the same dimensions of the U matrix. We also show that our method is computationally more efficient than the other method. Retaining the first n dimensions (columns) of the transformed matrix Y = XV, we have

$y_{i, j} = \sum_{k = 1}^{N} x_{i, k} v_{k, j}, 1 \leq j \leq n, 1 \leq i \leq M .$

The rest of the columns of the matrix are set to the column mean, which is zero in our case. (Some data sets are studentized by dividing the zero-mean column by the standard deviation [3].) The squared error in representing point Pi, 1 ≤ i ≤ M is then

$e_{i} = \sum_{j = n + 1}^{N} y_{i, j}^{2} = \sum_{j = n + 1}^{N} {(\sum_{k = 1}^{N} x_{i, k} v_{k, j})}^{2}, 1 \leq i \leq M$ .

The elements of X are approximated as X′ ≈ USV T by utilizing the first n dimensions of U [8,18]:

${x^{'}}_{i, j} = \sum_{k = 1}^{n} s_{k} u_{i, k} v_{k, j}, 1 \leq j \leq N, 1 \leq i \leq M .$

Noting that Y = XV = US, the squared error in representing Pi is

${e^{'}}_{i} = \sum_{j = 1}^{N} {(\sum_{k = n + 1}^{N} s_{k} u_{i, k} v_{k, j})}^{2} = \sum_{j = 1}^{N} {(\sum_{k = n + 1}^{N} y_{i, k} v_{k, j})}^{2}$ .

To show that $e_{i} = {e^{'}}_{i}, 1 \leq i \leq M$ , we use the definition δk,k′ = 1 for k = k′ and δk,k′ = 0 for k ≠ k′.

$\begin{matrix} {e^{'}}_{i} = \sum_{j = 1}^{N} (\sum_{k = n + 1}^{N} y_{i, k} v_{k, j}) (\sum_{k^{'} = n + 1}^{N} y_{i, k^{'}} v_{k^{'}, j}) = \sum_{k = n + 1}^{N} \sum_{k^{'} = n + 1}^{N} y_{i, k} y_{i, k^{'}} \sum_{j = 1}^{N} v_{k, j} v_{k^{'}, j} \\ = \sum_{k = n + 1}^{N} \sum_{k^{'} = n + 1}^{N} y_{i, k} y_{i, k^{'}} δ_{k, k^{'}} = \sum_{j = n + 1}^{N} y_{i, j}^{2} = e_{i} . \end{matrix}$

Our method is more efficient from the viewpoint of computational cost for nearest neighbor, range, and window queries. In the case of nearest neighbor queries, once the input or query vector is transformed and the appropriate n dimensions are selected, we need only be concerned with these dimensions. This is less costly than applying the inverse transformation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3 - Singular Value Decomposition, Clustering, and Indexing for Similarity Search for Large Data Sets in High-Dimensional Spaces

Create new playlist

Sign In

Sign Up