7. High-Dimensional Computational Geometry (3/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

High-Dimensional Computational Geometry 115

The Hausdoﬀ metric contains a copy of 

∞

, and hence, it is a natural host space. A few

such embeddings into 

∞

, for diﬀerent types of metrics X isshownin[47].

Are there even better host spaces? 

∞

is another natural target, especially that it contains

any other metric. However, as a host, 

∞

often requires prohibitively high dimension: even

embedding the Hamming cube {0, 1}

into 

∞

requires dimension exponential in d (see

arguments from [67, Chapter 1, Section 8]). Hence, new candidates are needed.

It turns out that the mixture of norms is a qualitatively better host. Let us deﬁne what

we mean by mixture ﬁrst. The norm 

(

)isthespaceofd×k matrices, where we compute

the norm by ﬁrst taking 

norm of each row, and then take the 

norm of the resulting d-

dimensional vector. The strength of such mixed norms has been shown in [14], who showed

that the aforementioned Ulam metric over length-d strings (raised to power 0.9) embeds into



[

∞

(

)], with dimensions bounded polynomially in d. The distortion of the embedding

is constant. By contrast, embedding into any of the component spaces provably requires

super-constant distortion (see, e.g., Table 7.1 and [67, Chapter 1, Section 8]).

7.5 Sketching

A generalization of the embedding is the notion of sketching. Sketching can be thought

of as a very weak embedding into a computational space, where the host distance is an

arbitrary computation (e.g., not restricted to be a metric). The main parameter is now the

size of the sketch, which can be thought of as host dimension. Sketches address the most

basic, decision version of the distance estimation problem, termed the distance threshold

estimation problem (DTEP) [91]. The goal of DTEP is to distinguish close pairs versus far

pairs with some probability (similar to the Deﬁnition 7.3 of near neighbor problem).

Deﬁnition 7.5 Fix some metric (M,d

), approximation D ≥ 1, threshold r>0,failure

probability δ ∈ [0, 1/3], and size s ∈ N.Themapφ : M →{0, 1}

is called a sketch of M if

there exists a referee algorithm R : {0, 1}

×{0, 1}

→{0, 1}, such that for any x, y ∈ M,

the algorithm R can distinguish between d

(x, y) ≤ r (close pair) versus d

(x, y) >Dr

(far pair), with probability 1 − δ.

Sketching is useful in many applications, including the NNS problem. In particular, we can

use a sketch of size s (for, say, δ =0.1) to construct an NNS data structure with space

O(s)

. The query time is O(log n) times the time to evaluate φ on the query point. Indeed,

from the basic sketch, we can also construct an ampliﬁed sketch, which has O(s log n) size

and failure probability at most 1 −1/n

: keep k = O(log n) independent copies of the basic

sketch, with the referee algorithm taking the majority vote of those for the k sketches. For

a given query point q, this ampliﬁed sketch, termed φ

(q), is suﬃcient to determine the

approximate near neighbor in an n-point dataset P with probability at least 1 − 1/n:run

the referee algorithm on φ

(q), φ

(p) for each p ∈ P . Because this procedure uses only the

sketch φ

(q), of size ks, we can construct an index for each possible ks-bit input, storing

the solution. Overall, we obtain a data structure with 2

= n

O(s)

space.

Sketches have many applications beyond NNS, including in data-streaming algorithms

and compressed sensing, in part by providing an alternative notion of dimension reduction

(as we will see later). More recently, it has found uses to speed up algorithms, for example,

in numerical linear algebra [95].

116 Handbook of Big Data

Some of these applications use another important variant of sketches, where M is a

norm, and we deﬁne φ : M → R

to be a linear map. In this case, we think of s as the

number of measurements, as in the compressed sensing literature. Such a linear sketch has

the advantage to being easy to update: because φ(x + a)=φ(x)+φ(a), where x is the

current vector,anda is an update vector. As such, it has many applications in streaming.

7.5.1 Sketch Constructions

The ﬁrst and most useful constructions are for the Euclidean and Hamming spaces. Again,

the JL dimension reduction for 

already gives a sketch. In particular, dimension reduction

is a linear sketch into s = O[(1/

)log1/δ] dimensions for D =1+ approximation (see

also [6]).

Most interestingly, 

admits a similar linear sketch, to give us a weaker notion of

dimension reduction for 

.Thesketchisφ(p)=(1/k)Ap,inwhichA is a k × d matrix in

which each entry is drawn i.i.d. from the Cauchy distribution [58]. The referee algorithm,

on inputs φ(p), φ(q), then computes the median value of absolute values of the coordinates

of φ(p) −φ(q). Note that this is diﬀerent from aforementioned dimension reduction, which

can be thought of a sketch with the referee algorithm that computes φ(p) − φ(q).The

median operation is what makes this host space nonmetric.

LSH is also a sketch, with a special property that the referee algorithm just checks for the

equality of the two arguments. Although Deﬁnition 7.2 of LSH does not immediately imply

a size bound, one can easily notice that an LSH function leads to a sketch of size dependent

on P

and P

only. Rather than describing that, we give a direct sketch construction for the

Hamming space.

For the Hamming space, Indyk and Motwani [60] and Kushilevitz et al. [76] show a

sketch achieving a size of s = O[(1/

)log1/δ] bits, for approximation D =1+.The

sketch is a modiﬁcation of the bit sampling LSH from Section 7.3.2. For ﬁxed threshold

r ≤ d/2, the coordinate i ∈ [s]ofthemapφ(p) is simply deﬁned as the exclusive-or of

k = d/r randomly chosen bits from p. The referee algorithm for inputs φ(p), φ(q)just

computes the Hamming distance between φ(p)andφ(q): the output is close pair iﬀ this

fraction is less than a certain threshold.

This sketch plays a very central role for a few reasons. First, it also implies a similar

sketch for 

of constant size. Second, all sketches of constant size for other metrics M are

essentially obtained via a two-step process: ﬁrst embed M into the Hamming space, and

then apply the above sketch. Third, recent results suggest that, at least for norms,this

approach is the only one in town for obtaining constant size sketches [18].

Nonetheless, there exist other methods to obtain sketches with a nontrivial size, namely,

sublinear in the dimension of the input. For example, for the Ulam metric, one can obtain

a sketch of polylog

O(1)

d size and constant approximation [14]. For EMD metric, the best

sketch achieves size d



for O(1/) approximation [9]. There are also methods for obtaining

sketches for arbitrary mixed norms [17,65].

Finally, an important aspect of the sketch deﬁnition is that we can prove impossibility

results for them. Sketching lower bounds usually follow via communication complexity

arguments [75] for the following communication game. Two players, called Alice and Bob,

each have an input point x and y, respectively, and they need to solve the DTEP problem.

For this, Alice and Bob each send a message of length s to the referee, who is to decide on

the problem. It is easy to see that lower bounds on communication complexity of this game

imply sketching lower bounds.

High-Dimensional Computational Geometry 117

For example, one can prove that the Hamming sketch from above is optimal [62,64,94].

Many other lower bounds have seen been proven. Interestingly, in light of the sketch for



, lower bounds on sketching complexity often imply nonembeddability statements as well

(see, e.g., [16]).

7.6 Small Dimension

Yet another approach to high-dimensional questions is to assume that the dataset has

an additional structure, for example, is intrinsically low dimensional. Such assumptions

are motivated by the fact that, in many datasets, the data are really explained by a few

relevant parameters, and the other parameters are just derivates of those. The simplest such

example is when a d-dimensional dataset lies inside a low-dimensional subspace S ⊂ R

,of

dimension k  d. In such cases, we would like to obtain solutions with a performance as

if the dimension is eﬀectively k.Ifk is small, we may be able to aﬀord solutions that have

exponential dependence on k (e.g., (1/)

O(k)

for 1 +  approximation).

Although the above subspace assumption may be too na¨ıve, there are other, more

realistic deﬁnitions. Particular attention has been drawn to the notion of doubling dimension,

deﬁned for a dataset P as follows: Let λ be the smallest integer such that for any p ∈ P and

radius r, the set of points that are within r of p can be covered by at most λ balls of radius

r/2. The doubling dimension is then log λ. It is easy to check that d-dimensional 

and 

spaces have doubling dimension O(d). The notion, introduced in [32,52], has been inspired

by the notion of Assouad constant [22]. Some NNS algorithms designed with this notion in

mind include [24,73] and others. For example, Indyk and Naor [54] showed that, given a

point-set with a doubling dimension k, one can use JL lemma to project it into dimension

m = O(k/

· log 1/) only, suﬃcient for preserving the approximate nearest neighbor.

Other notions of intrinsic dimension include Karger–Ruhl dimension [70], smooth

manifolds [23,34], and others [1,31,33,41,46,54].

7.7 Conclusion

As mentioned in the beginning of the chapter, the focus is on techniques for high-dimensional

geometry, in particular as applied to the NNS problem.

Many of the same techniques apply in other contexts. For example, some other

computational problems in the realm of high-dimensional geometry include the closest pair

problem, or variants of clustering. A number of such problems admit eﬃcient solutions based

on eﬃcient NNS problem, or the aforementioned techniques directly (see, [57] for examples).

For instance, consider the closest pair problem: given a set of n points, we are to ﬁnd the

closest pair inside it. We can solve the problem using NNS: Just construct an NNS data

structure, and then query each point to ﬁnd its closest match. In fact, until recently, this has

been the fastest way to solve the (approximate) closest pair problem. Valiant [92] showed

how to obtain faster solutions for a random dataset, exploiting faster matrix multiplication

methods.

118 Handbook of Big Data

Still, there are other important high-dimensional geometry topics left uncovered here.

The most prominent such topic is that of embedding ﬁnite metrics into 

or other simple

host spaces. Introduced to theoretical computer science in the seminal paper of [78], such

embeddings have had a big impact on the design of approximation algorithms such as the

sparsest cut problem [20,21,89]. We refer to the survey [59] for more details.

References

1. Amirali Abdullah, Alexandr Andoni, Ravindran Kannan, and Robert Krauthgamer.

Spectral approaches to nearest neighbor search. In Proceedings of the Symposium on

Foundations of Computer Science, Philadelphia, PA, 2014.

2. Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss

with binary coins. J. Comput. Syst. Sci., 66(4):671–687, 2003. (Special issue on PODS

2001 [Santa Barbara, CA].)

3. Nir Ailon and Bernard Chazelle. The fast Johnson–Lindenstrauss transform and

approximate nearest neighbors. SIAM J. Comput., 39(1):302–322, 2009.

4. Nir Ailon and Edo Liberty. An almost optimal unrestricted fast Johnson-Lindenstrauss

transform. ACM Trans. Algor., 9(3):21, 2013. (Previously in SODA’11.)

5. Noga Alon. Problems and results in extremal combinatorics I. Discrete Math., 273:

31–53, 2003.

6. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating

the frequency moments. J. Comput. Syst. Sci., 58:137–147, 1999. (Previously appeared

in STOC’96.)

7. Alexandr Andoni, Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab Mirrokni.

Locality-sensitive hashing scheme based on p-stable distributions. Nearest Neighbor

Methods for Learning and Vision: Theory and Practice, Neural Processing Information

Series. MIT Press, Cambridge, MA, 2006.

8. Alexandr Andoni, Michel Deza, Anupam Gupta, Piotr Indyk, and Sofya Raskhodnikova.

Lower bounds for embedding edit distance into normed spaces. In Proceedings of the

ACM-SIAM Symposium on Discrete Algorithms, pp. 523–526, 2003.

9. Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David Woodruﬀ. Eﬃcient sketches

for earth-mover distance, with applications. In Proceedings of the Symposium on

Foundations of Computer Science, 2009.

10. Alexandr Andoni and Piotr Indyk. Eﬃcient algorithms for substring near neighbor

problem. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms,

pp. 1203–1212, 2006.

11. Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate

nearest neighbor in high dimensions. In Proceedings of the Symposium on Foundations

of Computer Science, pp. 459–468, 2006.

12. Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate

nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008.

High-Dimensional Computational Geometry 119

13. Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Earth mover distance over

high-dimensional spaces. In Proceedings of the ACM-SIAM Symposium on Discrete

Algorithms, pp. 343–352, 2008. (Previously ECCC Report TR07-048.)

14. Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Overcoming the 

non-

embeddability barrier: Algorithms for product metrics. In Proceedings of the ACM-

SIAM Symposium on Discrete Algorithms, pp. 865–874, 2009.

15. Alexandr Andoni, Piotr Indyk, Huy L. Nguyen, and Ilya Razenshteyn. Beyond locality-

sensitive hashing. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms,

2014. Full version at http://arxiv.org/abs/1306.1547.

16. Alexandr Andoni and Robert Krauthgamer. The computational hardness of estimating

edit distance. SIAM J. Comput. (SICOMP), 39(6):2398–2429, 2010. (Previously in

FOCS’07.)

17. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms

from precision sampling. In Proceedings of the Symposium on Foundations of Computer

Science, 2011. Full version at http://arxiv.org/abs/1011.1263.

18. Alexandr Andoni, Robert Krauthgamer, and Ilya Razenshteyn. Sampling and embed-

ding are equivalent for norms. In Proceedings of the Symposium on Theory of Computing,

2015. Full version at http://arxiv.org/abs/1411.2577.

19. Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approxi-

mate near neighbors. In Proceedings of the Symposium on Theory of Computing, 2015.

Full version at http://arxiv.org/abs/1501.01062.

20. Sanjeev Arora, James R. Lee, and Assaf Naor. Euclidean distortion and the sparsest

cut. In Proceedings of the Symposium on Theory of Computing, pp. 553–562, 2005.

21. Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander ﬂows, geometric embeddings,

and graph partitionings. In Proceedings of the Symposium on Theory of Computing,

pp. 222–231, 2004.

22. Patrice Assouad. Plongements lipschitziens dans R

. Bull. Soc. Math. France, 111(4):

429–448, 1983.

23. Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds.

In Foundations of Computational Mathematics, pp. 941–944, 2006.

24. Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor.

In Proceedings of the 23rd International Conference on Machine Learning, pp. 97–104,

2006.

25. Bo Brinkman and Moses Charikar. On the impossibility of dimension reduction in L1.

J. ACM, 52(5):766–788, 2005.

26. Andrei Broder. On the resemblance and containment of documents. In Proceedings of

Compression and Complexity of Sequences, pp. 21–29, 1997.

27. Andrei Broder, Steve Glassman, Mark Manasse, and Geoﬀrey Zweig. Syntactic cluster-

ing of the web. In Proceedings of the 6th International World Wide Web Conference,

pp. 391–404, 1997.

28. Moses Charikar. Similarity estimation techniques from rounding. In Proceedings of the

Symposium on Theory of Computing, pp. 380–388, 2002.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. High-Dimensional Computational Geometry (3/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
7. High-Dimensional Computational Geometry (3/4)