10. Something for (Almost) Nothing: New Advances in Sublinear-Time Algorithms (1/3)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Something f or (Almost) Nothing:

New Advances in Sublinear-Time Algorithms

Ronitt Rubinfeld and Eric Blais

CONTENTS

10.1 Introduction .................................................................... 155

10.1.1 Clustering: An Illustrative Example ................................... 156

10.2 Testing Properties of Point Sets ............................................... 157

10.2.1 Clustering .............................................................. 157

10.2.2 Surface Area ........................................................... 158

10.3 Testing Properties of Graphs .................................................. 159

10.3.1 Diameter ............................................................... 159

10.3.2 Vertex Cover ........................................................... 160

10.4 Testing Properties of Functions ................................................ 160

10.4.1 Feature Selection ...................................................... 161

10.4.2 Model Selection ........................................................ 161

10.4.3 Data Quality ........................................................... 162

10.4.3.1 Lipschitz Property .......................................... 162

10.4.3.2 Monotonicity ............................................... 163

10.4.4 Alternative Query and Sampling Models ............................. 163

10.5 Other Topics ................................................................... 164

References ............................................................................. 164

10.1 Introduction

What computational problems can we solve when we only have time to look at a tiny

fraction of the data? This general question has been studied from many diﬀerent angles

in statistics. More recently, with the recent proliferation of massive datasets, it has also

become a central question in computer science as well. Essentially, all of the research on

this question starts with a simple observation: Except for a very small number of special

cases, the only problems that can be solved in this very restrictive setting are those that

admit approximate solutions.

In this chapter, we focus on a formalization of approximate solutions that has been

widely studied in an area of theoretical computer science known as property testing.Let

X denote the underlying dataset, and consider the setting where this dataset represents a

combinatorial object. Let P be any property of this type of combinatorial object. We say

that X is -close to having property P if we can modify at most an  fraction of X to

obtain the description X



of an object that does have property P ;otherwise,wesaythatX

155

156 Handbook of Big Data

is -far from having the property. A randomized algorithm A is an -tester for P if it can

distinguish with large constant probability

∗

between datasets that represent objects with

the property P from those that are -far from having the same property. (The algorithm

A is free to output anything on inputs that do not have the property P but are also not

-far from having this property; it is this leeway that will enable property testers to be so

eﬃcient.)

There is a close connection between property testing and the general parameter

estimation. Let X be a dataset and θ = θ(X) be any parameter of this dataset. For every

threshold t, we can deﬁne the property of having θ ≤ t. If we have an eﬃcient algorithm

for testing this property, we can also use it to eﬃciently obtain an estimate



θ that is close

to θ in the sense that



θ ≤ θ and the underlying object X is -close to another dataset



with θ(X



θ. Note that this notion of closeness is very diﬀerent from the notions

usually considered in parameter estimation; instead of determining it as a function L(θ,



θ)

of the true and estimated values of the parameter itself, here the quality of the estimate is

a function of the underlying dataset.

10.1.1 Clustering: An Illustrative Example

Let X be a set of n points in R

. A fundamental question is how well the points in X can be

clustered. This question can be formalized as follows. Let r>0 be a ﬁxed radius, and let θ

be the minimum number of clusters of radius r that are required to capture all the points in

X. The problem of determining θ

exactly is NP-hard, as is the problem of estimating its

value up to any constant factor [20,23,40]. But Alon et al. [1] show that the closely related

problem of distinguishing datasets that can be clustered into at most k clusters of radius r

from those that are far from having this property can be solved in time that is independent

of the number of points in X.

†

Theorem 10.1 [1] There is an algorithm that examines at most



k·d



) points in X and

with large probability (i) accepts if the points in X can be grouped into at most k clusters of

radius r and (ii) rejects if no subset of (1 −)n points in X can be grouped into k clusters.

The testing algorithm for clustering can be used to obtain an eﬃcient estimator of the

clustering parameter θ

in the following sense.

Corollary 10.1. There is an algorithm that examines at most



O((θ

· d)/) points in X

andreturnsavalue



such that with large probability



≤ θ

and all but at most n points

in X can be partitioned into



clusters of radius r.

We reiterate that, unlike in the standard parameter estimation setting, we have no guarantee

on the closeness of θ

and the estimate



returned by the algorithm. Instead, we

are guaranteed that most of the points can be clustered in



clusters. This notion of

approximation has a number of advantages.

∗

That is, with probability 1 − δ for some ﬁxed constant δ < (1/2). With standard techniques, the success

probability of a tester can easily be boosted to any arbitrary amount, so throughout this chapter, we will

simply ﬁx δ to be a small enough constant, say, δ =(1/3) and write with large constant probability to mean

with probability 1 − δ.

†

Here and throughout the rest of the chapter, running times are in the model where samples or queries

to X are completed in a constant amount of time. We will also use the tilde notation

O(m) to hide polylog

factors in m for clarity of presentation.

Something for (Almost) Nothing 157

First, it is robust to noise. Adding a small number of outlier points can have a signiﬁcant

impact on the clustering parameter θ

of a dataset. However, since the algorithm only

examines a constant number of points, with high probability, it does not observe any of the

outliers and so these noisy points do not aﬀect the estimator. Furthermore, many of the

property testing algorithms can be made robust to higher noise rates, so that even if outliers

are observed, they do not aﬀect the result by much. Similarly, the algorithm is appropriate

when the underlying dataset is constantly changing; the same robustness means that the

estimate changes smoothly as points are added and removed from the dataset.

A second advantage is that it leads to extremely eﬃcient algorithms. The algorithms are

so eﬃcient that they can be used as a preprocessing step, even in situations where more

accurate estimates of the clustering parameter are required. This is the case, for example,

in settings where we are interested in learning how well the data can be clustered; here, the

testing algorithm can be used in a preprocessing step to quickly identify point sets that are

not even close to being clusterable into a reasonable number of clusters. In the case that

the preprocessing algorithm identiﬁes such a point set, alternative learning algorithms can

be applied to those datasets instead.

The study of property testing was initiated in [6,26,55] and has been extremely active

ever since. In the rest of this chapter, we highlight some of the properties of point sets,

graphs, and functions that can be tested eﬃciently. Our goal is simply to provide a glimpse of

the results that can be obtained with the tools that have been developed in this research area.

More detailed discussions on these topics can be found in many other more comprehensive

surveys [12,24,34,50–54].

10.2 Testing Properties of Point Sets

A common machine learning problem involves extracting structural information from a set

X of points. As we have already seen in the introduction, sublinear-time algorithms can be

used to eﬃciently perform preprocessing tasks for these problems. In fact, these algorithms

have been shown to be useful in testing properties of both ﬁnite sets of points and (implicit

representations of) inﬁnite sets of points. To illustrate the former, we revisit the clustering

problem in a bit more detail; for the latter, we show how sublinear-time algorithms can be

used to estimate the surface area of continuous sets.

10.2.1 Clustering

Let us return to the problem of clustering points. To illustrate the variety of formalizations

of this problem that are eﬃciently testable, let us now consider the setting where X is a

set of points in a general metric space and we bound the diameter of the clusters (i.e., the

maximum distance between any two points in the cluster) by some parameter d.

We can test if X can be clustered into at most k clusters of diameter d very eﬃciently

in the following way.

Theorem 10.2 [1] Let X be a set of points in a metric space. There is an algorithm that

queries O((k log k)/) points in X and with large constant probability (i) accepts if X can

be clustered into k clusters of diameter d and (ii) rejects if we cannot cluster the points into

k clusters of diameter 2d even if we remove an  fraction of the points in X.

158 Handbook of Big Data

The algorithm that achieves the bound in Theorem 10.1 maintains a set of cluster centers

by repeatedly picking a sample point that is not covered by previously chosen cluster

centers. If more than k cluster centers are chosen by this process, then the algorithm rejects,

otherwise it accepts. The work of [1] shows that this simple algorithm satisﬁes the theorem’s

requirements. Note that if the point set cannot be clustered into k clusters of diameter d,

but can be clustered into k clusters of diameter 2d if less than  fraction of the points are

thrown out, then it is OK for the algorithm to either accept or reject.

In many situations, we might know that the points are not clusterable into k clusters, but

would still like to know if a very large fraction of the points can be clustered into k clusters.

Note that our standard deﬁnition of property testers is of no use in this tolerant version

of the clustering problem. The tolerant problem is in general more diﬃcult, but several

tolerant clustering problems can also be solved with sublinear-time algorithms [12,48]. For

example, there is a tolerant analog of Theorem 10.2.

Theorem 10.3 [48] Given 0 < 

< 

< 1.LetX be a set of points in a metric space.

There is an algorithm that queries at most

O(k/(

− 

)

) points in X,andwithlarge

constant probability (i) accepts if at least (1 − 

) fraction of X can be clustered into k

clusters of diameter d, and (ii) rejects if we cannot cluster more than (1 − 

) fraction of

the points into k clusters of diameter 2d.

For more variants of the clustering problem considered in the property testing framework,

see, for example, [1,11,12,41,48].

10.2.2 Surface Area

The property testing framework can also be applied to settings where the dataset represents

a subset of a continuous space. Let S ⊆ [0, 1]

be a subset of the unit n-dimensional

hypercube, and consider any representation X of this subset for which algorithms can query

an input x ∈ [0, 1]

and observe whether x ∈ S or not.

A fundamental parameter of the set S is its surface area, deﬁned by surf(S)=

lim

δ→0



S

(δ/2)



/δ,whereS

(δ/2)

⊆ [0, 1]

is the set of points at distance at most δ/2

from the boundary of S. In the one-dimensional case, S ⊆ [0, 1] is a union of intervals and

surf(S) corresponds to the number of disjoint intervals in S.Andwhenn =2,surf(S)

corresponds to the perimeter of the set S.

Since the surface area of a set S can always be increased by an arbitrary amount while

modifying S itself on a set of measure 0, it is impossible to estimate the surface area of S

within any additive or multiplicative factor with any ﬁnite (or even countable) number of

queries. We can, however, test whether S has small surface area with a small number of

queries.

Theorem 10.4 [42] Fix any 0 < α < 1. There is an algorithm that queries at most

O(1/) points and with large probability (i) accepts S when it has surface area at most α,

and (ii) rejects S when every set S



that diﬀers from S on at most an  measure has surface

area surf(S



) > α.

Neeman’s theorem [42] was obtained by improving the original analysis of an algorithm due

to Kothari et al. [32], who were in turn building on previous work of Kearns and Ron [31]

and Balcan et al. [3]. The algorithm itself works in a way that is conceptually similar to

Buﬀon’s classic needle problem: It drops a (virtual) needle onto the space [0, 1]

and queries

the two endpoints of the needles. If exactly one of those endpoints is in S, then we know

that the needle crossed the boundary of S. Computing the probability of this event gives us

Something for (Almost) Nothing 159

a measure of the noise sensitivity of S, and connections between the surface area and noise

sensitivity then lead to the proof of correctness of the algorithm.

10.3 Testing Properties of Graphs

The setting in which the underlying dataset X represents a graph G is particularly well

suited for the design of sublinear-time algorithms. In this section, we explore two problems—

a connectivity problem and an optimization problem—to illustrate what sublinear-time

algorithms can do.

10.3.1 Diameter

The famous small world theory states that in any social group, every two members of the

group are connected to each other via a short chain of acquaintances [13]. In particular, the

six degrees of separation theory posits that everyone on Earth is connected by a chain of

acquaintances of length at most 6. (See [16] for an introduction to these and many other

interesting topics on graphs.) How eﬃciently can we test the small world theory on a given

social network?

The study of the small world theory can be formalized in terms of the diameter of graphs.

Let G be a graph where each person in the social group corresponds to a vertex, and an edge

is placed between vertices that correspond to people that know each other. The diameter

of G is the maximum distance between any two vertices in the graph. We can test whether

the small world theory holds for the social group represented by G by testing whether the

diameter of this graph is small. Parnas and Ron have shown that we can perform this task

in the property testing framework with a number of queries that is independent of the size

of the graph.

Theorem 10.5 [46] Let G be a graph with maximum degree d. There is an algorithm that

queries





−3



edges in G and with large probability (1) accepts if G has diameter at most

D and (2) rejects if every subgraph G



obtained by removing at most an  fraction of the

vertices in G has diameter greater than 2D.

This result should look at least a little surprising at ﬁrst glance: With a number of queries

that is sublinear in n, we cannot even determine the distance between any two ﬁxed vertices

v, w in G. Therefore, any algorithm for testing the diameter of a graph must instead test

G for some other global property that distinguishes graphs with diameter D from those

that are far from having diameter 2D. That is what the algorithm in [46] does. Speciﬁcally,

the algorithm estimates the expansion of G by sampling several start vertices, performing

O(1/) steps of a breadth-ﬁrst search from each of these vertices, and estimating the size

of the neighborhoods around the vertices that have been sampled.

The algorithm for testing small-diameter graphs can also be used to estimate the

diameter θ

diam

of a graph G.

Corollary 10.2. Let G be a graph with n vertices and maximum degree d. For every γ > 1,

there is an algorithm that queries at most





−3

log

diam



edges of G and outputs an

estimate



diam

such that with large probability



diam

≤ θ

diam

and every subgraph G



with at

least (1 − )n nodes has diameter at least



diam

/2γ.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10. Something for (Almost) Nothing: New Advances in Sublinear-Time Algorithms (1/3)

Create new playlist

Sign In

Sign Up

Table of Contents for
10. Something for (Almost) Nothing: New Advances in Sublinear-Time Algorithms (1/3)