10
Something f or (Almost) Nothing:
New Advances in Sublinear-Time Algorithms
Ronitt Rubinfeld and Eric Blais
CONTENTS
10.1 Introduction .................................................................... 155
10.1.1 Clustering: An Illustrative Example ................................... 156
10.2 Testing Properties of Point Sets ............................................... 157
10.2.1 Clustering .............................................................. 157
10.2.2 Surface Area ........................................................... 158
10.3 Testing Properties of Graphs .................................................. 159
10.3.1 Diameter ............................................................... 159
10.3.2 Vertex Cover ........................................................... 160
10.4 Testing Properties of Functions ................................................ 160
10.4.1 Feature Selection ...................................................... 161
10.4.2 Model Selection ........................................................ 161
10.4.3 Data Quality ........................................................... 162
10.4.3.1 Lipschitz Property .......................................... 162
10.4.3.2 Monotonicity ............................................... 163
10.4.4 Alternative Query and Sampling Models ............................. 163
10.5 Other Topics ................................................................... 164
References ............................................................................. 164
10.1 Introduction
What computational problems can we solve when we only have time to look at a tiny
fraction of the data? This general question has been studied from many different angles
in statistics. More recently, with the recent proliferation of massive datasets, it has also
become a central question in computer science as well. Essentially, all of the research on
this question starts with a simple observation: Except for a very small number of special
cases, the only problems that can be solved in this very restrictive setting are those that
admit approximate solutions.
In this chapter, we focus on a formalization of approximate solutions that has been
widely studied in an area of theoretical computer science known as property testing.Let
X denote the underlying dataset, and consider the setting where this dataset represents a
combinatorial object. Let P be any property of this type of combinatorial object. We say
that X is -close to having property P if we can modify at most an fraction of X to
obtain the description X
of an object that does have property P ;otherwise,wesaythatX
155
156 Handbook of Big Data
is -far from having the property. A randomized algorithm A is an -tester for P if it can
distinguish with large constant probability
between datasets that represent objects with
the property P from those that are -far from having the same property. (The algorithm
A is free to output anything on inputs that do not have the property P but are also not
-far from having this property; it is this leeway that will enable property testers to be so
efficient.)
There is a close connection between property testing and the general parameter
estimation. Let X be a dataset and θ = θ(X) be any parameter of this dataset. For every
threshold t, we can define the property of having θ t. If we have an efficient algorithm
for testing this property, we can also use it to efficiently obtain an estimate
θ that is close
to θ in the sense that
θ θ and the underlying object X is -close to another dataset
X
with θ(X
)=
θ. Note that this notion of closeness is very different from the notions
usually considered in parameter estimation; instead of determining it as a function L(θ,
θ)
of the true and estimated values of the parameter itself, here the quality of the estimate is
a function of the underlying dataset.
10.1.1 Clustering: An Illustrative Example
Let X be a set of n points in R
d
. A fundamental question is how well the points in X can be
clustered. This question can be formalized as follows. Let r>0 be a fixed radius, and let θ
r
be the minimum number of clusters of radius r that are required to capture all the points in
X. The problem of determining θ
r
exactly is NP-hard, as is the problem of estimating its
value up to any constant factor [20,23,40]. But Alon et al. [1] show that the closely related
problem of distinguishing datasets that can be clustered into at most k clusters of radius r
from those that are far from having this property can be solved in time that is independent
of the number of points in X.
Theorem 10.1 [1] There is an algorithm that examines at most
O(
k·d
) points in X and
with large probability (i) accepts if the points in X can be grouped into at most k clusters of
radius r and (ii) rejects if no subset of (1 )n points in X can be grouped into k clusters.
The testing algorithm for clustering can be used to obtain an efficient estimator of the
clustering parameter θ
r
in the following sense.
Corollary 10.1. There is an algorithm that examines at most
O((θ
r
· d)/) points in X
andreturnsavalue
θ
r
such that with large probability
θ
r
θ
r
and all but at most n points
in X can be partitioned into
θ
r
clusters of radius r.
We reiterate that, unlike in the standard parameter estimation setting, we have no guarantee
on the closeness of θ
r
and the estimate
θ
r
returned by the algorithm. Instead, we
are guaranteed that most of the points can be clustered in
θ
r
clusters. This notion of
approximation has a number of advantages.
That is, with probability 1 δ for some fixed constant δ < (1/2). With standard techniques, the success
probability of a tester can easily be boosted to any arbitrary amount, so throughout this chapter, we will
simply fix δ to be a small enough constant, say, δ =(1/3) and write with large constant probability to mean
with probability 1 δ.
Here and throughout the rest of the chapter, running times are in the model where samples or queries
to X are completed in a constant amount of time. We will also use the tilde notation
˜
O(m) to hide polylog
factors in m for clarity of presentation.
Something for (Almost) Nothing 157
First, it is robust to noise. Adding a small number of outlier points can have a significant
impact on the clustering parameter θ
r
of a dataset. However, since the algorithm only
examines a constant number of points, with high probability, it does not observe any of the
outliers and so these noisy points do not affect the estimator. Furthermore, many of the
property testing algorithms can be made robust to higher noise rates, so that even if outliers
are observed, they do not affect the result by much. Similarly, the algorithm is appropriate
when the underlying dataset is constantly changing; the same robustness means that the
estimate changes smoothly as points are added and removed from the dataset.
A second advantage is that it leads to extremely efficient algorithms. The algorithms are
so efficient that they can be used as a preprocessing step, even in situations where more
accurate estimates of the clustering parameter are required. This is the case, for example,
in settings where we are interested in learning how well the data can be clustered; here, the
testing algorithm can be used in a preprocessing step to quickly identify point sets that are
not even close to being clusterable into a reasonable number of clusters. In the case that
the preprocessing algorithm identifies such a point set, alternative learning algorithms can
be applied to those datasets instead.
The study of property testing was initiated in [6,26,55] and has been extremely active
ever since. In the rest of this chapter, we highlight some of the properties of point sets,
graphs, and functions that can be tested efficiently. Our goal is simply to provide a glimpse of
the results that can be obtained with the tools that have been developed in this research area.
More detailed discussions on these topics can be found in many other more comprehensive
surveys [12,24,34,50–54].
10.2 Testing Properties of Point Sets
A common machine learning problem involves extracting structural information from a set
X of points. As we have already seen in the introduction, sublinear-time algorithms can be
used to efficiently perform preprocessing tasks for these problems. In fact, these algorithms
have been shown to be useful in testing properties of both finite sets of points and (implicit
representations of) infinite sets of points. To illustrate the former, we revisit the clustering
problem in a bit more detail; for the latter, we show how sublinear-time algorithms can be
used to estimate the surface area of continuous sets.
10.2.1 Clustering
Let us return to the problem of clustering points. To illustrate the variety of formalizations
of this problem that are efficiently testable, let us now consider the setting where X is a
set of points in a general metric space and we bound the diameter of the clusters (i.e., the
maximum distance between any two points in the cluster) by some parameter d.
We can test if X can be clustered into at most k clusters of diameter d very efficiently
in the following way.
Theorem 10.2 [1] Let X be a set of points in a metric space. There is an algorithm that
queries O((k log k)/) points in X and with large constant probability (i) accepts if X can
be clustered into k clusters of diameter d and (ii) rejects if we cannot cluster the points into
k clusters of diameter 2d even if we remove an fraction of the points in X.
158 Handbook of Big Data
The algorithm that achieves the bound in Theorem 10.1 maintains a set of cluster centers
by repeatedly picking a sample point that is not covered by previously chosen cluster
centers. If more than k cluster centers are chosen by this process, then the algorithm rejects,
otherwise it accepts. The work of [1] shows that this simple algorithm satisfies the theorem’s
requirements. Note that if the point set cannot be clustered into k clusters of diameter d,
but can be clustered into k clusters of diameter 2d if less than fraction of the points are
thrown out, then it is OK for the algorithm to either accept or reject.
In many situations, we might know that the points are not clusterable into k clusters, but
would still like to know if a very large fraction of the points can be clustered into k clusters.
Note that our standard definition of property testers is of no use in this tolerant version
of the clustering problem. The tolerant problem is in general more difficult, but several
tolerant clustering problems can also be solved with sublinear-time algorithms [12,48]. For
example, there is a tolerant analog of Theorem 10.2.
Theorem 10.3 [48] Given 0 <
1
<
2
< 1.LetX be a set of points in a metric space.
There is an algorithm that queries at most
˜
O(k/(
2
1
)
2
) points in X,andwithlarge
constant probability (i) accepts if at least (1
1
) fraction of X can be clustered into k
clusters of diameter d, and (ii) rejects if we cannot cluster more than (1
2
) fraction of
the points into k clusters of diameter 2d.
For more variants of the clustering problem considered in the property testing framework,
see, for example, [1,11,12,41,48].
10.2.2 Surface Area
The property testing framework can also be applied to settings where the dataset represents
a subset of a continuous space. Let S [0, 1]
n
be a subset of the unit n-dimensional
hypercube, and consider any representation X of this subset for which algorithms can query
an input x [0, 1]
n
and observe whether x S or not.
A fundamental parameter of the set S is its surface area, defined by surf(S)=
lim
δ0
S
(δ/2)
/δ,whereS
(δ/2)
[0, 1]
n
is the set of points at distance at most δ/2
from the boundary of S. In the one-dimensional case, S [0, 1] is a union of intervals and
surf(S) corresponds to the number of disjoint intervals in S.Andwhenn =2,surf(S)
corresponds to the perimeter of the set S.
Since the surface area of a set S can always be increased by an arbitrary amount while
modifying S itself on a set of measure 0, it is impossible to estimate the surface area of S
within any additive or multiplicative factor with any finite (or even countable) number of
queries. We can, however, test whether S has small surface area with a small number of
queries.
Theorem 10.4 [42] Fix any 0 < α < 1. There is an algorithm that queries at most
O(1/) points and with large probability (i) accepts S when it has surface area at most α,
and (ii) rejects S when every set S
that differs from S on at most an measure has surface
area surf(S
) > α.
Neeman’s theorem [42] was obtained by improving the original analysis of an algorithm due
to Kothari et al. [32], who were in turn building on previous work of Kearns and Ron [31]
and Balcan et al. [3]. The algorithm itself works in a way that is conceptually similar to
Buffon’s classic needle problem: It drops a (virtual) needle onto the space [0, 1]
n
and queries
the two endpoints of the needles. If exactly one of those endpoints is in S, then we know
that the needle crossed the boundary of S. Computing the probability of this event gives us
Something for (Almost) Nothing 159
a measure of the noise sensitivity of S, and connections between the surface area and noise
sensitivity then lead to the proof of correctness of the algorithm.
10.3 Testing Properties of Graphs
The setting in which the underlying dataset X represents a graph G is particularly well
suited for the design of sublinear-time algorithms. In this section, we explore two problems—
a connectivity problem and an optimization problem—to illustrate what sublinear-time
algorithms can do.
10.3.1 Diameter
The famous small world theory states that in any social group, every two members of the
group are connected to each other via a short chain of acquaintances [13]. In particular, the
six degrees of separation theory posits that everyone on Earth is connected by a chain of
acquaintances of length at most 6. (See [16] for an introduction to these and many other
interesting topics on graphs.) How efficiently can we test the small world theory on a given
social network?
The study of the small world theory can be formalized in terms of the diameter of graphs.
Let G be a graph where each person in the social group corresponds to a vertex, and an edge
is placed between vertices that correspond to people that know each other. The diameter
of G is the maximum distance between any two vertices in the graph. We can test whether
the small world theory holds for the social group represented by G by testing whether the
diameter of this graph is small. Parnas and Ron have shown that we can perform this task
in the property testing framework with a number of queries that is independent of the size
of the graph.
Theorem 10.5 [46] Let G be a graph with maximum degree d. There is an algorithm that
queries
˜
O
3
edges in G and with large probability (1) accepts if G has diameter at most
D and (2) rejects if every subgraph G
obtained by removing at most an fraction of the
vertices in G has diameter greater than 2D.
This result should look at least a little surprising at first glance: With a number of queries
that is sublinear in n, we cannot even determine the distance between any two fixed vertices
v, w in G. Therefore, any algorithm for testing the diameter of a graph must instead test
G for some other global property that distinguishes graphs with diameter D from those
that are far from having diameter 2D. That is what the algorithm in [46] does. Specifically,
the algorithm estimates the expansion of G by sampling several start vertices, performing
O(1/) steps of a breadth-first search from each of these vertices, and estimating the size
of the neighborhoods around the vertices that have been sampled.
The algorithm for testing small-diameter graphs can also be used to estimate the
diameter θ
diam
of a graph G.
Corollary 10.2. Let G be a graph with n vertices and maximum degree d. For every γ > 1,
there is an algorithm that queries at most
˜
O
3
log
γ
θ
diam
edges of G and outputs an
estimate
θ
diam
such that with large probability
θ
diam
θ
diam
and every subgraph G
with at
least (1 )n nodes has diameter at least
θ
diam
/2γ.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.186.202