158 Handbook of Big Data
The algorithm that achieves the bound in Theorem 10.1 maintains a set of cluster centers
by repeatedly picking a sample point that is not covered by previously chosen cluster
centers. If more than k cluster centers are chosen by this process, then the algorithm rejects,
otherwise it accepts. The work of [1] shows that this simple algorithm satisfies the theorem’s
requirements. Note that if the point set cannot be clustered into k clusters of diameter d,
but can be clustered into k clusters of diameter 2d if less than fraction of the points are
thrown out, then it is OK for the algorithm to either accept or reject.
In many situations, we might know that the points are not clusterable into k clusters, but
would still like to know if a very large fraction of the points can be clustered into k clusters.
Note that our standard definition of property testers is of no use in this tolerant version
of the clustering problem. The tolerant problem is in general more difficult, but several
tolerant clustering problems can also be solved with sublinear-time algorithms [12,48]. For
example, there is a tolerant analog of Theorem 10.2.
Theorem 10.3 [48] Given 0 <
1
<
2
< 1.LetX be a set of points in a metric space.
There is an algorithm that queries at most
˜
O(k/(
2
−
1
)
2
) points in X,andwithlarge
constant probability (i) accepts if at least (1 −
1
) fraction of X can be clustered into k
clusters of diameter d, and (ii) rejects if we cannot cluster more than (1 −
2
) fraction of
the points into k clusters of diameter 2d.
For more variants of the clustering problem considered in the property testing framework,
see, for example, [1,11,12,41,48].
10.2.2 Surface Area
The property testing framework can also be applied to settings where the dataset represents
a subset of a continuous space. Let S ⊆ [0, 1]
n
be a subset of the unit n-dimensional
hypercube, and consider any representation X of this subset for which algorithms can query
an input x ∈ [0, 1]
n
and observe whether x ∈ S or not.
A fundamental parameter of the set S is its surface area, defined by surf(S)=
lim
δ→0
S
(δ/2)
/δ,whereS
(δ/2)
⊆ [0, 1]
n
is the set of points at distance at most δ/2
from the boundary of S. In the one-dimensional case, S ⊆ [0, 1] is a union of intervals and
surf(S) corresponds to the number of disjoint intervals in S.Andwhenn =2,surf(S)
corresponds to the perimeter of the set S.
Since the surface area of a set S can always be increased by an arbitrary amount while
modifying S itself on a set of measure 0, it is impossible to estimate the surface area of S
within any additive or multiplicative factor with any finite (or even countable) number of
queries. We can, however, test whether S has small surface area with a small number of
queries.
Theorem 10.4 [42] Fix any 0 < α < 1. There is an algorithm that queries at most
O(1/) points and with large probability (i) accepts S when it has surface area at most α,
and (ii) rejects S when every set S
that differs from S on at most an measure has surface
area surf(S
) > α.
Neeman’s theorem [42] was obtained by improving the original analysis of an algorithm due
to Kothari et al. [32], who were in turn building on previous work of Kearns and Ron [31]
and Balcan et al. [3]. The algorithm itself works in a way that is conceptually similar to
Buffon’s classic needle problem: It drops a (virtual) needle onto the space [0, 1]
n
and queries
the two endpoints of the needles. If exactly one of those endpoints is in S, then we know
that the needle crossed the boundary of S. Computing the probability of this event gives us