7
High-Dimensional Computational G eometry
Alexandr Andoni
CONTENTS
7.1 Introduction .................................................................... 105
7.2 Dimension Reduction .......................................................... 106
7.3 Space Partitions ................................................................ 107
7.3.1 From LSH to NNS ..................................................... 108
7.3.2 LSH Families .......................................................... 109
7.3.2.1 Bit Sampling (For Hamming Distance) .................... 110
7.3.2.2 Grid Partitioning (For
1
Distance) ........................ 110
7.3.2.3 Random Projection (For Euclidean and
p
Distances) ..... 110
7.3.2.4 Ball Carving (For Euclidean Distance) .................... 110
7.3.2.5 Min-Hash and Sim-Hash ................................... 111
7.3.3 Optimality of LSH ..................................................... 111
7.3.4 Data-Dependent Space Partitions ..................................... 111
7.3.5 The
Norm .......................................................... 112
7.4 Embeddings .................................................................... 112
7.4.1 Embeddings of Norms ................................................. 113
7.4.2 Embeddings of Specialized Metrics .................................... 113
7.5 Sketching ....................................................................... 115
7.5.1 Sketch Constructions .................................................. 116
7.6 Small Dimension ............................................................... 117
7.7 Conclusion ...................................................................... 117
References ............................................................................. 118
7.1 Introduction
Consider the following common similarity search scenario. We are given a new image, and we
need to determine whether it is similar to some image in the existing database, preprocessed
beforehand. How do we capture the notion of similar? The standard approach is to use a
suitable distance metric. For example, we can represent a 20 × 20-pixel image as a 400-
dimensional vector, one coordinate per pixel. Then we can measure the (dis)similarity using
the Euclidean distance in R
400
.
This problem, called nearest neighbor search (NNS), admits a straightforward solution—
scanning the entire database—but it is prohibitive for databases of modern sizes. Instead, it
is imperative to obtain a solution with run-time that is sublinear in the database size. The
bad news is that this problem suffers from the curse of dimensionality: all solutions degrade
exponentially fast with the dimension. In fact, the problem is not believed to have such
105
106 Handbook of Big Data
sublinear solutions. The good news is that vast improvements are possible in both theory
and practice, once we settle for approximate answers. In practice, approximate answers
prove to be sufficient even if one looks for exact answers: for many natural datasets there
are only few entries that are close to being an optimal answer (false positives), and thus we
can efficiently filter them.
The above is just an example of how high-dimensional geometry emerges in big data
questions. In general, the geometry is not necessarily Euclidean, and indeed many interesting
questions appear for distances such as Hamming, earthmover distance, and edit distance
(Levenshtein distance), to name just a few. This geometric perspective confers a number of
benefits. First, it brings in our geometric intuition to argue about sets of objects as sets of
points. Second, geometric perspective allows us to bring in some powerful tools from the
areas such as metric embeddings and functional analysis.
In this chapter, we survey the techniques of the high-dimensional computational
geometry through the prism of the NNS problem. From a broader perspective, NNS is just
one example of the questions arising when dealing with massive datasets. Yet these questions
often benefit from common solution concepts that already surface when considering the NNS
problem. In fact, oftentimes, NNS itself serves as a building block for the solutions of the
other questions. Hence, we focus on NNS to have a consistent storyline.
We will start by looking at the aforementioned setting: high-dimensional Euclidean
space, denoted by R
d
where d is the dimension. This is perhaps the most basic high-
dimensional setting. In general, we expect to see a trade-off between the complexity of
the geometric structure and the algorithmic efficiency. We now give the exact definition of
the approximate nearest neighbor problem.
Definition 7.1 c-approximate nearest neighbor Given a set P of n points in a d-
dimensional space R
d
, and approximation factor c>1, construct a data structure such
that, given any query point q, the data structure returns some point p such that d(q, p)
c · min
p
P
d(q, p
).
7.2 Dimension Reduction
If high dimensionality presents a problem, why not reduce it? Indeed, the high-dimensional
geometry 101 technique is to map the points into a space of smaller dimension k, while
preserving the distances. The most classic result is the Johnson–Lindenstrauss (JL) lemma
[66]. It says that a projection onto a random k-dimensional subspace suffices, provided k,is
large enough.
Lemma 7.1 JL [66] Let A be the projection onto a random k-dimensional subspace of R
d
,
scaled by
1/k. Then, for any fixed > 0,andpointsx, y R
d
, we have that
Pr
A
AxAy
xy
(1 , 1+)
1 e
Ω(
2
k)
.
An important consequence is that we can argue about distances among a set of n points.
In particular, setting k (C
2
log n), we obtain the probability of preserving a fixed
distance as high as 1 1/n
C
, for arbitrary large constant C. By union bound over the n
2
vectors, pairwise differences of the n points are preserved with probability at least 1n
C+2
.
In fact, this is the most common usage of the lemma.
Another crucial aspect of the above map is that it is oblivious, that is, independent of
the set of points. The alternative, a nonoblivious map, would have to first look at the n
points before constructing the actual map A. The advantage of an oblivious map is that we
can apply it to a new point (sometimes called out-of-sample extension).
High-Dimensional Computational Geometry 107
These properties are already enough to speed up NNS. Indeed, we can pick a projection
A as above, and then store points Ap,wherep P for a given dataset P .Onqueryq,
we compute the projection Aq, and then select the point p that minimizes Aq Ap.
The correctness of the algorithm is guaranteed by the JL lemma (applied to the point-set
P ∪{q}). The run-time of the query procedure is O(nk)+O(dk)fork = O(
2
log n). Note
that the second term corresponds to computing the projection Aq. The query time is an
improvement over the na¨ıve O(nd) bound for large d.
Following the original JL lemma, many natural extensions arose. Below are a few that
have drawn a particular attention in the theoretical computer science literature:
Can we improve the target dimension k?
In general, the answer is no. Jayram and Woodruff [63] show any oblivious map must
use dimension k (
2
log n). Even for nonoblivious maps, Alon [5] shows that k =
Ω
((1/
2
)/(log 1/)) log n
is required. See also [77].
However, such dimension reduction may be possible for a point-set with additional
structure. For example, if the point-set lies on a line (in R
d
), a much smaller k will
suffice. We give an example in Section 7.6.
Are there dimension reductions for other geometric concepts besides distances?
It turns out that the JL lemma may be used for other concepts, such as preserving l-flats
(for l k) or volumes of l-tuples of points [79]. In general, such applications require a
higher target dimension k.
Are there dimension reductions for other spaces besides Euclidean?
In general, the answer is no. For example, even for the Manhattan
1
space, it has been
proved that dimension reduction for n points, preserving distances up to factor c>1,
must use n
Ω(1/c
2
)
dimensions [25]. More generally, Johnson and Naor [68] show that any
space satisfying JL-like dimension reduction must be close to an Euclidean space.
However, there is an alternative concept for dimension reduction that can be applied to
spaces such as
1
, as we will see in Section 7.5.
Can we choose a more efficient map? For example, can we compute Ap faster than in
O(dk)time?
Early results show alternative ways to choose A. For example, we can take A to have
each entry to be from a Gaussian distribution independent, identically distributed (i.i.d.)
[42,60], or even random ±1[2].
More surprisingly, we can pick A’s such that the computation of the map Ap takes time
kd. Such techniques, called fast JL transform, are often based on a (fast) Hadamard
transform. In particular, the first such construction from [3] shows that we can set
A = PHD,whereD is a (random) diagonal matrix, H is the Hadamard transform, and
P is a (random) sparse projection matrix. Another approach is also to use a carefully
designed sparse matrix A [40]. See also [4,69,72,85].
7.3 Space Partitions
Back to the NNS problem, how can we obtain a sublinear query time? Note that even with
the dimension reduction, the query time of the NNS algorithm from above is linear in n.The
next important concept in high-dimensional geometry is space partitions, represented via a
108 Handbook of Big Data
hash function h : R
d
U for some discrete set U. These lead to sublinear time solutions
as we will see next.
What properties would we need from such hash functions? We would like that close
pairs of points collide (are in the same part), whereas far pairs of points do not collide (are
in different parts). Whereas such a strong guarantee is impossible (think of the points near
one of the perimeters of a part), it becomes attainable in a randomized sense. This leads to
the notion of locality-sensitive hash (LSH) functions, defined as follows.
Definition 7.2 LSH For a positive real r>0 and c>1, a distribution H over hash
functions g : R
d
U is called (r, cr, P
1
,P
2
)-sensitive if for any two points p, q R
d
,we
have the following:
If p q≤r,thenPr
h∈H
[h(q)=h(p)] P
1
.
If p q≥cr,thenPr
h∈H
[h(q)=h(p)] P
2
.
In order for an LSH family to be useful, it has to satisfy P
1
>P
2
.
Note that the definition is designed for a fixed scale of distances, that is, we care only
about distances r (close) versus distances cr (far).WecanuseLSHtosolveathreshold
version of the nearest neighbor problem, called r-near neighbor problem, and defined as
follows:
Definition 7.3 c-approximate r-near neighbor Fix some threshold r>0 and approx-
imation factor c>1. Construct a data structure on a point-set P of n points in a d-
dimensional space R
d
, such that, given any query point q,ifP contains a some point p
within distance r of q, the data structure reports some point p P at distance at most cr
from q, with probability at least 0.9.
We can use an algorithm for the near neighbor problem to solve the nearest neighbor
problem as well [53]. Hence we will focus on the former, the near neighbor problem, in the
rest of the section.
7.3.1 From LSH to NNS
An LSH family H can be used to design an efficient algorithm for approximate near neighbor
search. The algorithm essentially is one (or more) hash table(s). To build one hash table, we
choose a hash function h ∈Hand hash each point p P into the bucket h(p). Because the
total number of buckets may be large, we retain only the nonempty buckets by resorting to
(standard) hashing
of the values h(p).
To process a query q, we look up the bucket h(q) and report the first cr-near neighbor
found there. Note that, if there exists some r-near neighbor p, then with probability at least
P
1
, it is present in the bucket h(q), and hence some point is reported. In general, the bucket
contains other points as well.
Thus, one hash table achieves a probability of success of P
1
, while using O(n) space. We
can increase the probability of success to, say, 0.9 by using L = O(1/P
1
) independent hash
tables, each with its own hash function chosen i.i.d. from H.
Note that, for one hash table, the query run-time depends on the number of far points
that collide with the query q. There are at most P
2
n such points in expectation. The overall
query time is composed of two terms: (1) L = O(1/P
1
) computations of a hash function
See [35] for more details on hashing.
High-Dimensional Computational Geometry 109
h ∈Hand (2) O(P
2
n · L) distance computations (in expectation). Hence, it is enough to
obtain P
2
=1/n. We will usually think of a hash function computation as taking O(d)time,
comparable to the time for distance computation.
Thus, the main LSH performance question becomes: what is the highest P
1
achievable for
P
2
1/n ? Usually, we obtain such LSH family by amplifying a base family. In particular,
given a base (r, cr, p
1
,p
2
)-sensitive family H, we can construct a new family G of hash
functions g =(h
1
,...,h
k
), where h
i
∈Hindependently. The new family G is (r, cr, P
1
,P
2
)
sensitive, where P
1
= p
k
1
and P
2
= p
k
2
.Fork =
log
1/p
2
n
,weobtainP
2
= p
k
2
1/n as
desired. At the same time, P
1
= p
k
1
p
log
1/p
2
n+1
1
= n
ρ
·p
1
,whereρ =(log1/p
1
)/(log 1/p
2
).
Note that ρ is invariant to the amplification process, that is, this LSH parameter of the
amplified family G is equal to the one of the base family H.
The parameter ρ =(log1/P
1
)/(log 1/P
2
) is thus the main measure of quality of an LSH
family, or, equivalently a (randomized) space partitioning. We summarize the above in the
following theorem, proved in the work of Indyk and Motwani [60] who introduced the LSH
scheme. We also sketch the entire algorithm in Figure 7.1.
Theorem 7.1 Fix d, n 1, threshold r 0, and approximation c 1.Supposethere
exist some (r, cr, P
1
,P
2
)-sensitive LSH family. Then there exists an algorithm for the
c-approximate r-near neighbor problem using O(n
1+ρ
/P
1
+ nd) storage and preprocessing
time. The query time is composed of O(n
ρ
/P
1
· log
1/P
2
n) computations of a hash function
and O(n
ρ
/P
1
) distance computations.
Because the query time includes the time to evaluate a hash function from the LSH family,
it depends on the actual choice of the LSH family. Furthermore, the stated space bound
does not include the space to store the description of the hash functions (which is typically
smaller than the other terms).
7.3.2 LSH Families
To complete the algorithm, we need to construct some actual LSH families. There is a
number of various LSH families proposed over years. As mentioned above, the main quality
parameter is ρ (as a function of c). The second important parameter is the run-time to
evaluate a hash function. We present a few important families below, and refer to [12] for a
more comprehensive survey of the LSH families.
Preprocessing:
(1) Choose L functions g
i
, i =1,...,L, b y setting g
i
=(h
i,1
,h
i,2
,...,h
i,k
), where h
i,1
,...,h
i,k
are
c hosen at random from the LSH family H.
(2) Construct L hash tables, where, for each j =1,...,L,thejth hash table contains the dataset points
hashed using the function g
j
.
Query algorithm for a query point q:
For each i =1, 2,...,L:
(1) Retrieve the points from the buc ket g
i
(q )intheith hash table.
(2) For each of the retriev ed point, compute the distance from q to it. If the point is a
correct answer, report it and stop.
FIGURE 7.1
Preprocessing and query algorithms based on LSH. For an (r, cr, p
1
,p
2
)-sensitive family H,
we set k =
log
1/p
2
n
and L = O(n
ρ
/p
1
), where ρ =(log1/p
1
)/(log 1/p
2
).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.18.218