7. High-Dimensional Computational Geometry (1/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

High-Dimensional Computational G eometry

Alexandr Andoni

CONTENTS

7.1 Introduction .................................................................... 105

7.2 Dimension Reduction .......................................................... 106

7.3 Space Partitions ................................................................ 107

7.3.1 From LSH to NNS ..................................................... 108

7.3.2 LSH Families .......................................................... 109

7.3.2.1 Bit Sampling (For Hamming Distance) .................... 110

7.3.2.2 Grid Partitioning (For 

Distance) ........................ 110

7.3.2.3 Random Projection (For Euclidean and 

Distances) ..... 110

7.3.2.4 Ball Carving (For Euclidean Distance) .................... 110

7.3.2.5 Min-Hash and Sim-Hash ................................... 111

7.3.3 Optimality of LSH ..................................................... 111

7.3.4 Data-Dependent Space Partitions ..................................... 111

7.3.5 The 

∞

Norm .......................................................... 112

7.4 Embeddings .................................................................... 112

7.4.1 Embeddings of Norms ................................................. 113

7.4.2 Embeddings of Specialized Metrics .................................... 113

7.5 Sketching ....................................................................... 115

7.5.1 Sketch Constructions .................................................. 116

7.6 Small Dimension ............................................................... 117

7.7 Conclusion ...................................................................... 117

References ............................................................................. 118

7.1 Introduction

Consider the following common similarity search scenario. We are given a new image, and we

need to determine whether it is similar to some image in the existing database, preprocessed

beforehand. How do we capture the notion of similar? The standard approach is to use a

suitable distance metric. For example, we can represent a 20 × 20-pixel image as a 400-

dimensional vector, one coordinate per pixel. Then we can measure the (dis)similarity using

the Euclidean distance in R

400

This problem, called nearest neighbor search (NNS), admits a straightforward solution—

scanning the entire database—but it is prohibitive for databases of modern sizes. Instead, it

is imperative to obtain a solution with run-time that is sublinear in the database size. The

bad news is that this problem suﬀers from the curse of dimensionality: all solutions degrade

exponentially fast with the dimension. In fact, the problem is not believed to have such

105

106 Handbook of Big Data

sublinear solutions. The good news is that vast improvements are possible in both theory

and practice, once we settle for approximate answers. In practice, approximate answers

prove to be suﬃcient even if one looks for exact answers: for many natural datasets there

are only few entries that are close to being an optimal answer (false positives), and thus we

can eﬃciently ﬁlter them.

The above is just an example of how high-dimensional geometry emerges in big data

questions. In general, the geometry is not necessarily Euclidean, and indeed many interesting

questions appear for distances such as Hamming, earthmover distance, and edit distance

(Levenshtein distance), to name just a few. This geometric perspective confers a number of

beneﬁts. First, it brings in our geometric intuition to argue about sets of objects as sets of

points. Second, geometric perspective allows us to bring in some powerful tools from the

areas such as metric embeddings and functional analysis.

In this chapter, we survey the techniques of the high-dimensional computational

geometry through the prism of the NNS problem. From a broader perspective, NNS is just

one example of the questions arising when dealing with massive datasets. Yet these questions

often beneﬁt from common solution concepts that already surface when considering the NNS

problem. In fact, oftentimes, NNS itself serves as a building block for the solutions of the

other questions. Hence, we focus on NNS to have a consistent storyline.

We will start by looking at the aforementioned setting: high-dimensional Euclidean

space, denoted by R

where d is the dimension. This is perhaps the most basic high-

dimensional setting. In general, we expect to see a trade-oﬀ between the complexity of

the geometric structure and the algorithmic eﬃciency. We now give the exact deﬁnition of

the approximate nearest neighbor problem.

Definition 7.1 c-approximate nearest neighbor Given a set P of n points in a d-

dimensional space R

, and approximation factor c>1, construct a data structure such

that, given any query point q, the data structure returns some point p such that d(q, p) ≤

c · min

∗

∈P

d(q, p

∗

7.2 Dimension Reduction

If high dimensionality presents a problem, why not reduce it? Indeed, the high-dimensional

geometry 101 technique is to map the points into a space of smaller dimension k, while

preserving the distances. The most classic result is the Johnson–Lindenstrauss (JL) lemma

[66]. It says that a projection onto a random k-dimensional subspace suﬃces, provided k,is

large enough.

Lemma 7.1 JL [66] Let A be the projection onto a random k-dimensional subspace of R

scaled by



1/k. Then, for any ﬁxed  > 0,andpointsx, y ∈ R

, we have that



Ax−Ay

x−y

∈ (1 − , 1+)



≥ 1 − e

−Ω(

An important consequence is that we can argue about distances among a set of n points.

In particular, setting k =Θ(C

−2

log n), we obtain the probability of preserving a ﬁxed

distance as high as 1 − 1/n

, for arbitrary large constant C. By union bound over the n

vectors, pairwise diﬀerences of the n points are preserved with probability at least 1−n

−C+2

In fact, this is the most common usage of the lemma.

Another crucial aspect of the above map is that it is oblivious, that is, independent of

the set of points. The alternative, a nonoblivious map, would have to ﬁrst look at the n

points before constructing the actual map A. The advantage of an oblivious map is that we

can apply it to a new point (sometimes called out-of-sample extension).

High-Dimensional Computational Geometry 107

These properties are already enough to speed up NNS. Indeed, we can pick a projection

A as above, and then store points Ap,wherep ∈ P for a given dataset P .Onqueryq,

we compute the projection Aq, and then select the point p that minimizes Aq − Ap.

The correctness of the algorithm is guaranteed by the JL lemma (applied to the point-set

P ∪{q}). The run-time of the query procedure is O(nk)+O(dk)fork = O(

−2

log n). Note

that the second term corresponds to computing the projection Aq. The query time is an

improvement over the na¨ıve O(nd) bound for large d.

Following the original JL lemma, many natural extensions arose. Below are a few that

have drawn a particular attention in the theoretical computer science literature:

• Can we improve the target dimension k?

In general, the answer is no. Jayram and Woodruﬀ [63] show any oblivious map must

use dimension k =Ω(

−2

log n). Even for nonoblivious maps, Alon [5] shows that k =



((1/

)/(log 1/)) log n



is required. See also [77].

However, such dimension reduction may be possible for a point-set with additional

structure. For example, if the point-set lies on a line (in R

), a much smaller k will

suﬃce. We give an example in Section 7.6.

• Are there dimension reductions for other geometric concepts besides distances?

It turns out that the JL lemma may be used for other concepts, such as preserving l-ﬂats

(for l  k) or volumes of l-tuples of points [79]. In general, such applications require a

higher target dimension k.

• Are there dimension reductions for other spaces besides Euclidean?

In general, the answer is no. For example, even for the Manhattan 

space, it has been

proved that dimension reduction for n points, preserving distances up to factor c>1,

must use n

Ω(1/c

)

dimensions [25]. More generally, Johnson and Naor [68] show that any

space satisfying JL-like dimension reduction must be close to an Euclidean space.

However, there is an alternative concept for dimension reduction that can be applied to

spaces such as 

, as we will see in Section 7.5.

• Can we choose a more eﬃcient map? For example, can we compute Ap faster than in

O(dk)time?

Early results show alternative ways to choose A. For example, we can take A to have

each entry to be from a Gaussian distribution independent, identically distributed (i.i.d.)

[42,60], or even random ±1[2].

More surprisingly, we can pick A’s such that the computation of the map Ap takes time

 kd. Such techniques, called fast JL transform, are often based on a (fast) Hadamard

transform. In particular, the ﬁrst such construction from [3] shows that we can set

A = PHD,whereD is a (random) diagonal matrix, H is the Hadamard transform, and

P is a (random) sparse projection matrix. Another approach is also to use a carefully

designed sparse matrix A [40]. See also [4,69,72,85].

7.3 Space Partitions

Back to the NNS problem, how can we obtain a sublinear query time? Note that even with

the dimension reduction, the query time of the NNS algorithm from above is linear in n.The

next important concept in high-dimensional geometry is space partitions, represented via a

108 Handbook of Big Data

hash function h : R

→ U for some discrete set U. These lead to sublinear time solutions

as we will see next.

What properties would we need from such hash functions? We would like that close

pairs of points collide (are in the same part), whereas far pairs of points do not collide (are

in diﬀerent parts). Whereas such a strong guarantee is impossible (think of the points near

one of the perimeters of a part), it becomes attainable in a randomized sense. This leads to

the notion of locality-sensitive hash (LSH) functions, deﬁned as follows.

Definition 7.2 LSH For a positive real r>0 and c>1, a distribution H over hash

functions g : R

→ U is called (r, cr, P

)-sensitive if for any two points p, q ∈ R

,we

have the following:

• If p − q≤r,thenPr

h∈H

[h(q)=h(p)] ≥ P

• If p − q≥cr,thenPr

h∈H

[h(q)=h(p)] ≤ P

In order for an LSH family to be useful, it has to satisfy P

Note that the deﬁnition is designed for a ﬁxed scale of distances, that is, we care only

about distances ≤ r (close) versus distances ≥ cr (far).WecanuseLSHtosolveathreshold

version of the nearest neighbor problem, called r-near neighbor problem, and deﬁned as

follows:

Definition 7.3 c-approximate r-near neighbor Fix some threshold r>0 and approx-

imation factor c>1. Construct a data structure on a point-set P of n points in a d-

dimensional space R

, such that, given any query point q,ifP contains a some point p

∗

within distance r of q, the data structure reports some point p ∈ P at distance at most cr

from q, with probability at least 0.9.

We can use an algorithm for the near neighbor problem to solve the nearest neighbor

problem as well [53]. Hence we will focus on the former, the near neighbor problem, in the

rest of the section.

7.3.1 From LSH to NNS

An LSH family H can be used to design an eﬃcient algorithm for approximate near neighbor

search. The algorithm essentially is one (or more) hash table(s). To build one hash table, we

choose a hash function h ∈Hand hash each point p ∈ P into the bucket h(p). Because the

total number of buckets may be large, we retain only the nonempty buckets by resorting to

(standard) hashing

∗

of the values h(p).

To process a query q, we look up the bucket h(q) and report the ﬁrst cr-near neighbor

found there. Note that, if there exists some r-near neighbor p, then with probability at least

, it is present in the bucket h(q), and hence some point is reported. In general, the bucket

contains other points as well.

Thus, one hash table achieves a probability of success of P

, while using O(n) space. We

can increase the probability of success to, say, 0.9 by using L = O(1/P

) independent hash

tables, each with its own hash function chosen i.i.d. from H.

Note that, for one hash table, the query run-time depends on the number of far points

that collide with the query q. There are at most P

n such points in expectation. The overall

query time is composed of two terms: (1) L = O(1/P

) computations of a hash function

∗

See [35] for more details on hashing.

High-Dimensional Computational Geometry 109

h ∈Hand (2) O(P

n · L) distance computations (in expectation). Hence, it is enough to

obtain P

=1/n. We will usually think of a hash function computation as taking O(d)time,

comparable to the time for distance computation.

Thus, the main LSH performance question becomes: what is the highest P

achievable for

≤ 1/n ? Usually, we obtain such LSH family by amplifying a base family. In particular,

given a base (r, cr, p

)-sensitive family H, we can construct a new family G of hash

functions g =(h

,...,h

), where h

∈Hindependently. The new family G is (r, cr, P

)

sensitive, where P

= p

and P

= p

.Fork =



log

1/p



,weobtainP

= p

≤ 1/n as

desired. At the same time, P

= p

≥ p

log

1/p

n+1

= n

−ρ

·p

,whereρ =(log1/p

)/(log 1/p

Note that ρ is invariant to the ampliﬁcation process, that is, this LSH parameter of the

ampliﬁed family G is equal to the one of the base family H.

The parameter ρ =(log1/P

)/(log 1/P

) is thus the main measure of quality of an LSH

family, or, equivalently a (randomized) space partitioning. We summarize the above in the

following theorem, proved in the work of Indyk and Motwani [60] who introduced the LSH

scheme. We also sketch the entire algorithm in Figure 7.1.

Theorem 7.1 Fix d, n ≥ 1, threshold r ≥ 0, and approximation c ≥ 1.Supposethere

exist some (r, cr, P

)-sensitive LSH family. Then there exists an algorithm for the

c-approximate r-near neighbor problem using O(n

1+ρ

+ nd) storage and preprocessing

time. The query time is composed of O(n

· log

1/P

n) computations of a hash function

and O(n

) distance computations.

Because the query time includes the time to evaluate a hash function from the LSH family,

it depends on the actual choice of the LSH family. Furthermore, the stated space bound

does not include the space to store the description of the hash functions (which is typically

smaller than the other terms).

7.3.2 LSH Families

To complete the algorithm, we need to construct some actual LSH families. There is a

number of various LSH families proposed over years. As mentioned above, the main quality

parameter is ρ (as a function of c). The second important parameter is the run-time to

evaluate a hash function. We present a few important families below, and refer to [12] for a

more comprehensive survey of the LSH families.

Preprocessing:

(1) Choose L functions g

, i =1,...,L, b y setting g

=(h

i,1

i,2

,...,h

i,k

), where h

i,1

,...,h

i,k

are

c hosen at random from the LSH family H.

(2) Construct L hash tables, where, for each j =1,...,L,thejth hash table contains the dataset points

hashed using the function g

Query algorithm for a query point q:

For each i =1, 2,...,L:

(1) Retrieve the points from the buc ket g

(q )intheith hash table.

(2) For each of the retriev ed point, compute the distance from q to it. If the point is a

correct answer, report it and stop.

FIGURE 7.1

Preprocessing and query algorithms based on LSH. For an (r, cr, p

)-sensitive family H,

we set k =



log

1/p



and L = O(n

), where ρ =(log1/p

)/(log 1/p

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. High-Dimensional Computational Geometry (1/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
7. High-Dimensional Computational Geometry (1/4)