Learning Structured Distributions 271
15.4.2 Aggregation of Structured D istributions
Aggregations of structured random variables are very popular as they can model many
rich phenomena. Two prominent examples of this sort are mixture models and sums of
simple random variables. Mixtures of structured distributions have received much attention
in statistics [51,57,64] and, more recently, in theoretical computer science [19,54].
We remark that early statistical work on mixture models focuses on parameter learning.
In practice, this problem is typically handled with nonconvex heuristics such as the
expectation–maximization (EM) algorithm. Recent algorithmic techniques rely on the
moment problem and tensor decomposition. However, such algorithms lead to sample
complexities that are inherently exponential in the number of components.
Learning sums of simple random variables has received recent attention in the computer
science literature [20,22]. Such distributions have various applications in areas such as survey
sampling, case-control studies, and survival analysis (see, e.g., [16] for the case of sums of
indicators).
15.5 The Cover Method and Sample Bounds
The first fundamental question that arises in the context of learning an unknown probability
distribution is information-theoretic:
What is the minimum sample size that is necessary and sufficient to learn an unknown
p ∈Cup to total variation distance ?
While this question has been extensively investigated in statistics, information theory,
and, more recently, computer science, the information-theoretically optimal sample size is
not yet understood, even for some relatively simple families of distributions. It turns out
that the optimal sample complexity depends on the structure of the underlying density class
in a subtle way.
In this section, we describe a general powerful method that yields nearly tight upper
bounds on the sample complexity of learning. The method, which we term the cover
method, is classical in statistics and information theory and has its roots in early work
of A. N. Kolmogorov. The high-level idea is to analyze the structure of the metric space
C under total variation distance. The method postulates that the structure of this metric
space characterizes the sample complexity of learning. To describethemethodindetailwe
introduce some basic terminology.
Let (X,d) be a metric space. Given δ > 0, a subset Y⊆Xis said to be a δ-cover
of X with respect to the metric d : X
2
→ R
+
if for every x ∈Xthere exists some
y ∈Ysuch that d(x, y) ≤ δ. There may exist many δ-covers of X, but one is typically
interested in those with minimum cardinality. The δ-covering number of (X,d)isthe
minimum cardinality of any δ-cover of X. Intuitively, the covering number captures the
size of the metric space.
Covering numbers—and their logarithms, known as metric entropy numbers—were first
defined by Kolmogorov in the 1950s and have since played a central role in a number of
areas, including approximation theory, geometric functional analysis (see, e.g., [8,31,53] and
[11,33,50,52]), information theory, statistics, and machine learning (see, e.g., [5,42,43,73,74]
and [27,65,68]).
In the context of distribution learning, the cover method is summarized in the following
theorem.