High-Dimensional Computational Geometry 115
The Hausdoff metric contains a copy of
, and hence, it is a natural host space. A few
such embeddings into
, for different types of metrics X isshownin[47].
Are there even better host spaces?
is another natural target, especially that it contains
any other metric. However, as a host,
often requires prohibitively high dimension: even
embedding the Hamming cube {0, 1}
d
into
requires dimension exponential in d (see
arguments from [67, Chapter 1, Section 8]). Hence, new candidates are needed.
It turns out that the mixture of norms is a qualitatively better host. Let us define what
we mean by mixture first. The norm
d
p
(
k
q
)isthespaceofd×k matrices, where we compute
the norm by first taking
q
norm of each row, and then take the
p
norm of the resulting d-
dimensional vector. The strength of such mixed norms has been shown in [14], who showed
that the aforementioned Ulam metric over length-d strings (raised to power 0.9) embeds into
2
[
(
1
)], with dimensions bounded polynomially in d. The distortion of the embedding
is constant. By contrast, embedding into any of the component spaces provably requires
super-constant distortion (see, e.g., Table 7.1 and [67, Chapter 1, Section 8]).
7.5 Sketching
A generalization of the embedding is the notion of sketching. Sketching can be thought
of as a very weak embedding into a computational space, where the host distance is an
arbitrary computation (e.g., not restricted to be a metric). The main parameter is now the
size of the sketch, which can be thought of as host dimension. Sketches address the most
basic, decision version of the distance estimation problem, termed the distance threshold
estimation problem (DTEP) [91]. The goal of DTEP is to distinguish close pairs versus far
pairs with some probability (similar to the Definition 7.3 of near neighbor problem).
Definition 7.5 Fix some metric (M,d
M
), approximation D 1, threshold r>0,failure
probability δ [0, 1/3], and size s N.Themapφ : M →{0, 1}
s
is called a sketch of M if
there exists a referee algorithm R : {0, 1}
s
×{0, 1}
s
→{0, 1}, such that for any x, y M,
the algorithm R can distinguish between d
M
(x, y) r (close pair) versus d
M
(x, y) >Dr
(far pair), with probability 1 δ.
Sketching is useful in many applications, including the NNS problem. In particular, we can
use a sketch of size s (for, say, δ =0.1) to construct an NNS data structure with space
n
O(s)
. The query time is O(log n) times the time to evaluate φ on the query point. Indeed,
from the basic sketch, we can also construct an amplified sketch, which has O(s log n) size
and failure probability at most 1 1/n
2
: keep k = O(log n) independent copies of the basic
sketch, with the referee algorithm taking the majority vote of those for the k sketches. For
a given query point q, this amplified sketch, termed φ
k
(q), is sufficient to determine the
approximate near neighbor in an n-point dataset P with probability at least 1 1/n:run
the referee algorithm on φ
k
(q), φ
k
(p) for each p P . Because this procedure uses only the
sketch φ
k
(q), of size ks, we can construct an index for each possible ks-bit input, storing
the solution. Overall, we obtain a data structure with 2
ks
= n
O(s)
space.
Sketches have many applications beyond NNS, including in data-streaming algorithms
and compressed sensing, in part by providing an alternative notion of dimension reduction
(as we will see later). More recently, it has found uses to speed up algorithms, for example,
in numerical linear algebra [95].
116 Handbook of Big Data
Some of these applications use another important variant of sketches, where M is a
norm, and we define φ : M R
s
to be a linear map. In this case, we think of s as the
number of measurements, as in the compressed sensing literature. Such a linear sketch has
the advantage to being easy to update: because φ(x + a)=φ(x)+φ(a), where x is the
current vector,anda is an update vector. As such, it has many applications in streaming.
7.5.1 Sketch Constructions
The first and most useful constructions are for the Euclidean and Hamming spaces. Again,
the JL dimension reduction for
2
already gives a sketch. In particular, dimension reduction
is a linear sketch into s = O[(1/
2
)log1/δ] dimensions for D =1+ approximation (see
also [6]).
Most interestingly,
1
admits a similar linear sketch, to give us a weaker notion of
dimension reduction for
1
.Thesketchisφ(p)=(1/k)Ap,inwhichA is a k × d matrix in
which each entry is drawn i.i.d. from the Cauchy distribution [58]. The referee algorithm,
on inputs φ(p), φ(q), then computes the median value of absolute values of the coordinates
of φ(p) φ(q). Note that this is different from aforementioned dimension reduction, which
can be thought of a sketch with the referee algorithm that computes φ(p) φ(q).The
median operation is what makes this host space nonmetric.
LSH is also a sketch, with a special property that the referee algorithm just checks for the
equality of the two arguments. Although Definition 7.2 of LSH does not immediately imply
a size bound, one can easily notice that an LSH function leads to a sketch of size dependent
on P
1
and P
2
only. Rather than describing that, we give a direct sketch construction for the
Hamming space.
For the Hamming space, Indyk and Motwani [60] and Kushilevitz et al. [76] show a
sketch achieving a size of s = O[(1/
2
)log1/δ] bits, for approximation D =1+.The
sketch is a modification of the bit sampling LSH from Section 7.3.2. For fixed threshold
r d/2, the coordinate i [s]ofthemapφ(p) is simply defined as the exclusive-or of
k = d/r randomly chosen bits from p. The referee algorithm for inputs φ(p), φ(q)just
computes the Hamming distance between φ(p)andφ(q): the output is close pair iff this
fraction is less than a certain threshold.
This sketch plays a very central role for a few reasons. First, it also implies a similar
sketch for
2
of constant size. Second, all sketches of constant size for other metrics M are
essentially obtained via a two-step process: first embed M into the Hamming space, and
then apply the above sketch. Third, recent results suggest that, at least for norms,this
approach is the only one in town for obtaining constant size sketches [18].
Nonetheless, there exist other methods to obtain sketches with a nontrivial size, namely,
sublinear in the dimension of the input. For example, for the Ulam metric, one can obtain
a sketch of polylog
O(1)
d size and constant approximation [14]. For EMD metric, the best
sketch achieves size d
for O(1/) approximation [9]. There are also methods for obtaining
sketches for arbitrary mixed norms [17,65].
Finally, an important aspect of the sketch definition is that we can prove impossibility
results for them. Sketching lower bounds usually follow via communication complexity
arguments [75] for the following communication game. Two players, called Alice and Bob,
each have an input point x and y, respectively, and they need to solve the DTEP problem.
For this, Alice and Bob each send a message of length s to the referee, who is to decide on
the problem. It is easy to see that lower bounds on communication complexity of this game
imply sketching lower bounds.
High-Dimensional Computational Geometry 117
For example, one can prove that the Hamming sketch from above is optimal [62,64,94].
Many other lower bounds have seen been proven. Interestingly, in light of the sketch for
1
, lower bounds on sketching complexity often imply nonembeddability statements as well
(see, e.g., [16]).
7.6 Small Dimension
Yet another approach to high-dimensional questions is to assume that the dataset has
an additional structure, for example, is intrinsically low dimensional. Such assumptions
are motivated by the fact that, in many datasets, the data are really explained by a few
relevant parameters, and the other parameters are just derivates of those. The simplest such
example is when a d-dimensional dataset lies inside a low-dimensional subspace S R
d
,of
dimension k d. In such cases, we would like to obtain solutions with a performance as
if the dimension is effectively k.Ifk is small, we may be able to afford solutions that have
exponential dependence on k (e.g., (1/)
O(k)
for 1 + approximation).
Although the above subspace assumption may be too na¨ıve, there are other, more
realistic definitions. Particular attention has been drawn to the notion of doubling dimension,
defined for a dataset P as follows: Let λ be the smallest integer such that for any p P and
radius r, the set of points that are within r of p can be covered by at most λ balls of radius
r/2. The doubling dimension is then log λ. It is easy to check that d-dimensional
2
and
1
spaces have doubling dimension O(d). The notion, introduced in [32,52], has been inspired
by the notion of Assouad constant [22]. Some NNS algorithms designed with this notion in
mind include [24,73] and others. For example, Indyk and Naor [54] showed that, given a
point-set with a doubling dimension k, one can use JL lemma to project it into dimension
m = O(k/
2
· log 1/) only, sufficient for preserving the approximate nearest neighbor.
Other notions of intrinsic dimension include Karger–Ruhl dimension [70], smooth
manifolds [23,34], and others [1,31,33,41,46,54].
7.7 Conclusion
As mentioned in the beginning of the chapter, the focus is on techniques for high-dimensional
geometry, in particular as applied to the NNS problem.
Many of the same techniques apply in other contexts. For example, some other
computational problems in the realm of high-dimensional geometry include the closest pair
problem, or variants of clustering. A number of such problems admit efficient solutions based
on efficient NNS problem, or the aforementioned techniques directly (see, [57] for examples).
For instance, consider the closest pair problem: given a set of n points, we are to find the
closest pair inside it. We can solve the problem using NNS: Just construct an NNS data
structure, and then query each point to find its closest match. In fact, until recently, this has
been the fastest way to solve the (approximate) closest pair problem. Valiant [92] showed
how to obtain faster solutions for a random dataset, exploiting faster matrix multiplication
methods.
118 Handbook of Big Data
Still, there are other important high-dimensional geometry topics left uncovered here.
The most prominent such topic is that of embedding finite metrics into
1
or other simple
host spaces. Introduced to theoretical computer science in the seminal paper of [78], such
embeddings have had a big impact on the design of approximation algorithms such as the
sparsest cut problem [20,21,89]. We refer to the survey [59] for more details.
References
1. Amirali Abdullah, Alexandr Andoni, Ravindran Kannan, and Robert Krauthgamer.
Spectral approaches to nearest neighbor search. In Proceedings of the Symposium on
Foundations of Computer Science, Philadelphia, PA, 2014.
2. Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss
with binary coins. J. Comput. Syst. Sci., 66(4):671–687, 2003. (Special issue on PODS
2001 [Santa Barbara, CA].)
3. Nir Ailon and Bernard Chazelle. The fast Johnson–Lindenstrauss transform and
approximate nearest neighbors. SIAM J. Comput., 39(1):302–322, 2009.
4. Nir Ailon and Edo Liberty. An almost optimal unrestricted fast Johnson-Lindenstrauss
transform. ACM Trans. Algor., 9(3):21, 2013. (Previously in SODA’11.)
5. Noga Alon. Problems and results in extremal combinatorics I. Discrete Math., 273:
31–53, 2003.
6. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating
the frequency moments. J. Comput. Syst. Sci., 58:137–147, 1999. (Previously appeared
in STOC’96.)
7. Alexandr Andoni, Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab Mirrokni.
Locality-sensitive hashing scheme based on p-stable distributions. Nearest Neighbor
Methods for Learning and Vision: Theory and Practice, Neural Processing Information
Series. MIT Press, Cambridge, MA, 2006.
8. Alexandr Andoni, Michel Deza, Anupam Gupta, Piotr Indyk, and Sofya Raskhodnikova.
Lower bounds for embedding edit distance into normed spaces. In Proceedings of the
ACM-SIAM Symposium on Discrete Algorithms, pp. 523–526, 2003.
9. Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David Woodruff. Efficient sketches
for earth-mover distance, with applications. In Proceedings of the Symposium on
Foundations of Computer Science, 2009.
10. Alexandr Andoni and Piotr Indyk. Efficient algorithms for substring near neighbor
problem. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms,
pp. 1203–1212, 2006.
11. Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate
nearest neighbor in high dimensions. In Proceedings of the Symposium on Foundations
of Computer Science, pp. 459–468, 2006.
12. Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate
nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008.
High-Dimensional Computational Geometry 119
13. Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Earth mover distance over
high-dimensional spaces. In Proceedings of the ACM-SIAM Symposium on Discrete
Algorithms, pp. 343–352, 2008. (Previously ECCC Report TR07-048.)
14. Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Overcoming the
1
non-
embeddability barrier: Algorithms for product metrics. In Proceedings of the ACM-
SIAM Symposium on Discrete Algorithms, pp. 865–874, 2009.
15. Alexandr Andoni, Piotr Indyk, Huy L. Nguyen, and Ilya Razenshteyn. Beyond locality-
sensitive hashing. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms,
2014. Full version at http://arxiv.org/abs/1306.1547.
16. Alexandr Andoni and Robert Krauthgamer. The computational hardness of estimating
edit distance. SIAM J. Comput. (SICOMP), 39(6):2398–2429, 2010. (Previously in
FOCS07.)
17. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms
from precision sampling. In Proceedings of the Symposium on Foundations of Computer
Science, 2011. Full version at http://arxiv.org/abs/1011.1263.
18. Alexandr Andoni, Robert Krauthgamer, and Ilya Razenshteyn. Sampling and embed-
ding are equivalent for norms. In Proceedings of the Symposium on Theory of Computing,
2015. Full version at http://arxiv.org/abs/1411.2577.
19. Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approxi-
mate near neighbors. In Proceedings of the Symposium on Theory of Computing, 2015.
Full version at http://arxiv.org/abs/1501.01062.
20. Sanjeev Arora, James R. Lee, and Assaf Naor. Euclidean distortion and the sparsest
cut. In Proceedings of the Symposium on Theory of Computing, pp. 553–562, 2005.
21. Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander flows, geometric embeddings,
and graph partitionings. In Proceedings of the Symposium on Theory of Computing,
pp. 222–231, 2004.
22. Patrice Assouad. Plongements lipschitziens dans R
n
. Bull. Soc. Math. France, 111(4):
429–448, 1983.
23. Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds.
In Foundations of Computational Mathematics, pp. 941–944, 2006.
24. Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor.
In Proceedings of the 23rd International Conference on Machine Learning, pp. 97–104,
2006.
25. Bo Brinkman and Moses Charikar. On the impossibility of dimension reduction in L1.
J. ACM, 52(5):766–788, 2005.
26. Andrei Broder. On the resemblance and containment of documents. In Proceedings of
Compression and Complexity of Sequences, pp. 21–29, 1997.
27. Andrei Broder, Steve Glassman, Mark Manasse, and Geoffrey Zweig. Syntactic cluster-
ing of the web. In Proceedings of the 6th International World Wide Web Conference,
pp. 391–404, 1997.
28. Moses Charikar. Similarity estimation techniques from rounding. In Proceedings of the
Symposium on Theory of Computing, pp. 380–388, 2002.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.176.5