272 Handbook of Big Data
Theorem 15.1. Let C be an arbitrary family of distributions and >0.LetC
⊆Cbe an
-cover of C of cardinality N. Then there is an algorithm that uses O(
−2
log N ) samples
from an unknown distribution p ∈Cand, with probability at least 9/10, outputs a distribution
h ∈C
that satisfies d
TV
(h, p) ≤ 6.
An equivalent version of Theorem 15.1 (with a slightly different terminology) was given by
Yatracos [74] (see also Chapter 7 of [27] for a detailed discussion). The above statement
appears as Lemma C.1 in [23].
As we explain in detail below, the algorithm implicit in the above theorem is not
computationally efficient in general. Indeed, even assuming that we have an explicit
construction of a minimal size -cover, the algorithm takes time at least Ω(N/
2
)—that
is, exponential in its sample size.
We point out that the cover method can serve as a very useful tool in the design of
computationally efficient learning algorithms. Indeed, many algorithms in the literature
work by constructing a small set S of candidate hypotheses with the guarantee that at
least one of them is close to the target distribution. The cover method can be used as a
postprocessing step to efficiently select an appropriate candidate in the set S. This simple
idea has been used in the design of fast proper learning algorithms for various natural classes
of distributions, including sums of independent integer random variables [20,22], Gaussian
mixtures [24,63], and other high-dimensional distributions [25].
We now provide a brief intuitive explanation of the argument in [23] establishing
Theorem 15.1. (The corresponding proof of [27,74] is quite similar.) Given a description
of the cover C
, the algorithm performs a tournament between the distributions in C
,
by running a hypothesis testing routine for every pair of distributions in C
. The obvious
implementation of this tournament takes time Ω(N
2
/
2
). Recent algorithmic work [24,63]
has improved this to nearly linear in N,namelyO(N log N/
2
). However, this running time
bound is still exponential in the sample complexity of the algorithm.
The hypothesis testing routine can be viewed as a simple competition between two
candidate hypothesis distributions. If at least one of the two candidate hypotheses is close
to the target distribution p, then with high probability over the samples drawn from p the
hypothesis testing routine selects as winner a candidate that is close to p. The algorithm
outputs a distribution in the cover C
that was never a loser (that is, won or tied against all
other distributions in the cover). We remark that the analysis of the algorithm is elementary,
relying only on the Chernoff bound and the union bound.
Another important property of the cover method is its noise tolerance. It generalizes
naturally yielding an agnostic learning algorithm with the same sample complexity.
More specifically, for an arbitrary target distribution p with opt
C
(p) = inf
q∈C
d
TV
(q, p),
the tournament-based algorithm makes O(
−2
log N ) i.i.d. draws from p and outputs a
hypothesis h in C
satisfying d
TV
(h, p) ≤ O(opt
C
(p)+). The reader is referred to
Chapter 7.3 of [27] for an explicit proof of this fact.
The sample upper bound of O(
−2
log N) cannot be improved in general, in the sense
that there exist distribution families, where it is information-theoretically optimal up to
constant factors. In fact, Yang and Barron [73] showed that for many smooth nonparametric
classes the metric entropy number characterizes the sample complexity of learning. We note,
however, that metric entropy does not provide a characterization in general. There exist
distribution families, where the O(
−2
log N ) sample upper bound is suboptimal.
As a simple example, consider the set of all singleton distributions over [n], that is,
the class contains n distinct distributions each supported on a single point of the domain.
It is easy to see that Theorem 15.1 gives a sample upper bound of O(
−2
log n)forthis
case, while one sample suffices to uniquely specify the target distribution. For a more
natural example, consider the class of Poisson binomial distributions (PBDs), that is,