272 Handbook of Big Data
Theorem 15.1. Let C be an arbitrary family of distributions and >0.LetC
⊆Cbe an
-cover of C of cardinality N. Then there is an algorithm that uses O(
2
log N ) samples
from an unknown distribution p ∈Cand, with probability at least 9/10, outputs a distribution
h ∈C
that satisfies d
TV
(h, p) 6.
An equivalent version of Theorem 15.1 (with a slightly different terminology) was given by
Yatracos [74] (see also Chapter 7 of [27] for a detailed discussion). The above statement
appears as Lemma C.1 in [23].
As we explain in detail below, the algorithm implicit in the above theorem is not
computationally efficient in general. Indeed, even assuming that we have an explicit
construction of a minimal size -cover, the algorithm takes time at least Ω(N/
2
)—that
is, exponential in its sample size.
We point out that the cover method can serve as a very useful tool in the design of
computationally efficient learning algorithms. Indeed, many algorithms in the literature
work by constructing a small set S of candidate hypotheses with the guarantee that at
least one of them is close to the target distribution. The cover method can be used as a
postprocessing step to efficiently select an appropriate candidate in the set S. This simple
idea has been used in the design of fast proper learning algorithms for various natural classes
of distributions, including sums of independent integer random variables [20,22], Gaussian
mixtures [24,63], and other high-dimensional distributions [25].
We now provide a brief intuitive explanation of the argument in [23] establishing
Theorem 15.1. (The corresponding proof of [27,74] is quite similar.) Given a description
of the cover C
, the algorithm performs a tournament between the distributions in C
,
by running a hypothesis testing routine for every pair of distributions in C
. The obvious
implementation of this tournament takes time Ω(N
2
/
2
). Recent algorithmic work [24,63]
has improved this to nearly linear in N,namelyO(N log N/
2
). However, this running time
bound is still exponential in the sample complexity of the algorithm.
The hypothesis testing routine can be viewed as a simple competition between two
candidate hypothesis distributions. If at least one of the two candidate hypotheses is close
to the target distribution p, then with high probability over the samples drawn from p the
hypothesis testing routine selects as winner a candidate that is close to p. The algorithm
outputs a distribution in the cover C
that was never a loser (that is, won or tied against all
other distributions in the cover). We remark that the analysis of the algorithm is elementary,
relying only on the Chernoff bound and the union bound.
Another important property of the cover method is its noise tolerance. It generalizes
naturally yielding an agnostic learning algorithm with the same sample complexity.
More specifically, for an arbitrary target distribution p with opt
C
(p) = inf
q∈C
d
TV
(q, p),
the tournament-based algorithm makes O(
2
log N ) i.i.d. draws from p and outputs a
hypothesis h in C
satisfying d
TV
(h, p) O(opt
C
(p)+). The reader is referred to
Chapter 7.3 of [27] for an explicit proof of this fact.
The sample upper bound of O(
2
log N) cannot be improved in general, in the sense
that there exist distribution families, where it is information-theoretically optimal up to
constant factors. In fact, Yang and Barron [73] showed that for many smooth nonparametric
classes the metric entropy number characterizes the sample complexity of learning. We note,
however, that metric entropy does not provide a characterization in general. There exist
distribution families, where the O(
2
log N ) sample upper bound is suboptimal.
As a simple example, consider the set of all singleton distributions over [n], that is,
the class contains n distinct distributions each supported on a single point of the domain.
It is easy to see that Theorem 15.1 gives a sample upper bound of O(
2
log n)forthis
case, while one sample suffices to uniquely specify the target distribution. For a more
natural example, consider the class of Poisson binomial distributions (PBDs), that is,
Learning Structured Distributions 273
sums
n
i=1
X
i
of n mutually independent Bernoulli random variables, X
1
,...,X
n
.Itisnot
difficult to show that the covering number of the set of PBDs is Ω(n/). Hence, Theorem 15.1
cannot give an upper bound better than
Ω(
2
) ·log n. On the other hand, a sample upper
bound of
O(
2
) was recently obtained in [22]. These examples raise the following natural
question.
Open Problem 15.1. Is there a complexity measure of a distribution class C that
characterizes the sample complexity of learning C?
We recall that the Vapnik–Chervonenkis dimension of a class of Boolean functions plays
such a role in Valiant’s PAC model [66], that is, it tightly characterizes the number of
examples that are required to PAC learn an arbitrary function from the class.
15.6 Learning Univariate Structured Distributions
In this section, we consider the problem of nonproper learning of an unknown univariate
probability distribution, that is, a distribution with a density function p R
+
,where
the sample space Ω is a subset of the real line. We focus on two basic cases: (1) Ω = [n],
where the set [n]isviewedasanorderedsetand(2=[a, b] with a b R. Given a
family C of univariate distributions, can we design a sample-optimal and computationally
efficient learning algorithm for C? Can we achieve this goal in the more challenging agnostic
setting? It turns out that the answer to both questions turns out to be yes for broad classes
of structured families C.
If the target distribution is arbitrary, the learning problem is well understood. More
specifically, suppose that the class C of target distributions is the set of all distributions over
[n]. It is a folklore fact that Θ(n/
2
) samples are necessary and sufficient for learning within
the total variation distance in this case. The underlying algorithm is also straightforward,
that is, output the empirical distribution. For distributions over very large domains, a linear
dependence on n is of course impractical, both from running time and sample complexity
perspective.
For continuous distributions the learning problem is not solvable without any assump-
tions. Indeed, learning an arbitrary distribution over [0, 1] to any constant accuracy < 1
requires infinitely many samples. This follows, for example, from the aforementioned discrete
lower bound for n →∞. Hence, it is important to focus our attention on structured
distribution families.
In the main part of this section, we describe recent work from theoretical computer
science that yields sample-optimal and computationally efficient algorithms for learning
broad classes of structured distributions. The main idea of the approach is that the existence
of good piecewise polynomial approximations for a family C can be leveraged for the design
of efficient learning algorithms for C. The approach is inspired and motivated by classical
results in statistics and combines a variety of techniques from algorithms, probability, and
approximation theory.
Piecewise polynomials (splines) have been extensively used in statistics as tools for
inference tasks, including density estimation, see, for example, [61,62,71,72]. We remark
that splines in statistics have been used in the context of the MLE, which is very different
than the aforementioned approach. Moreover, the degree of the splines used in statistical
literature is typically bounded by a small constant.
In Section 15.6.1, we describe classical work in statistics on learning monotone densities
that served as an inspiration for the piecewise polynomial approach. In Section 15.6.2, we
274 Handbook of Big Data
describe how to use piecewise constant approximations for learning and argue why it is
insufficient for some cases. Finally, in Section 15.6.3, we describe the general approach in
detail.
15.6.1 Learning Monotone Distributions
Monotonicity is arguably one of the simplest shape constraints. Learning a monotone density
was one of the first problems studied in this context by Grenander [38]. We present a result
by Bire [6,7], who gave a sample-optimal and computationally efficient algorithm for this
problem. More specifically, Birg´e showed the following.
Theorem 15.2 [6,7] Fix L, H > 0;letM be the set of nonincreasing densities
p :[0,L] [0,H]. There is a computationally efficient algorithm that given m =
O((1/
3
)log(1+H ·L)) samples from an arbitrary p ∈Moutputs a hypothesis h satisfying
d
TV
(h, p) with probability at least 9/10. Moreover, Ω((1/
3
) log(1 + H ·L)) samples are
information-theoretically necessary for this problem.
An adaptation of the above theorem holds for monotone distributions over [n], yielding
an efficient algorithm with optimal sample complexity of O((1/
3
)logn) for the discrete
setting as well.
To sketch the proof of this theorem, we will need a few definitions. Given m independent
samples s
1
,...,s
m
,drawnfromadensityp R
+
,theempirical distribution p
m
is the
discrete distribution supported on {s
1
,...,s
m
} defined as follows: for all z Ω, p
m
(z)=
|{j [m] | s
j
= z}|/m.
For a measurable function f : I R
+
and A I, we will denote f (A)=
A
f(x)dx.
Definition 15.1. A function f : I R is called a t-histogram if it is piecewise constant
with at most t interval pieces. For a function f : I R and an interval partition {I
1
,...,I
t
}
of the domain, the flattened version
¯
f of f is the t-histogram defined by
¯
f(x)=f(I
j
)/|I
j
|
for all x I
j
.
Birg´e’s algorithm works as follows [7]: it partitions the domain into a set of intervals and
outputs the flattened empirical distribution on those intervals. Its correctness relies on an
approximation lemma that he proves.
Lemma 15.1 [7] Fix L, H > 0. There exists a partition of [0,H] into t = O((1/)log(1+
H · L)) intervals such that for any p ∈Mit holds d
TV
p, p) .
An analog of the lemma holds for discrete monotone distributions over [n] establishing a
bound of t = O((1/)logn) on the number of intervals.
Note that the interval decomposition of the lemma is oblivious, in the sense that it
does not depend on the underlying monotone density. This is a very strong guarantee that
facilitates the learning algorithm. Indeed, given the guarantee of the lemma, the algorithm
is straightforward. The monotone learning problem is reduced to the problem of learning a
distribution over a known finite support of cardinality t = O((1/) log(1 + H · L)).
In summary, one can break Birg´e’s approach in two conceptual steps:
Prove that any monotone distribution is -close in total variation distance to a t-
histogram distribution, where the parameter t is small.
Agnostically learn the target distribution using the class of t-histogram distributions as
a hypothesis class.
This scheme is quite general and can be applied to any structured distribution class as
long as there exists a good piecewise constant approximation. In general, such a histogram
Learning Structured Distributions 275
approximation may not be fixed for all distributions in the family. Indeed, this is the case
for most natural families of distributions. To handle this case, we need an agnostic learning
algorithm for t-histogram distributions with an unknown partition.
15.6.2 Agnostically Learning Histograms
In this section, we study the problem of agnostically learning t-histogram distributions
with an unknown partition. Formally, given a bound t on the number of intervals, we
want to design a computationally efficient algorithm that uses an optimal sample size
and approximates the target distribution nearly as accurately as the best t-histogram.
As sketched in the previous section, such an algorithm would have several applications
in learning classes of shape-restricted densities.
Denote by H
t
the family of t-histogram distributions over [0, 1].
The first step is to
determine the optimal sample complexity of the learning problem. It is easy to see that
Ω(t/
2
) is a lower bound and simple arguments can be used to get an upper bound of
˜
O(t/
2
) using the cover method described in Section 15.5.
The problem of agnostically learning t-histogram distributions with
˜
O(t/
2
)samples
and poly(t/) time
is algorithmically nontrivial. If one is willing to relax the sample size
to O(t/
3
), it is easy to obtain a computationally efficient algorithm [13,21]. The first
efficient algorithm with near-optimal-sample complexity was obtained in [14] and is based
on dynamic programming.
To sketch the algorithm in [14] we will need a more general metric between distributions
that generalizes the total variation distance. Fix a family of subsets A over [0, 1]. We define
the A-distance between p and q by p q
A
:= max
A∈A
|p(A) q(A)|. (Note that if A
is the set of all measurable subsets of the domain, the A-distance is identical to the total
variation distance.) The Vapnik–Chervonenkis (VC)-dimension of A is the maximum size
of a subset X [0, 1] that is shattered by A (a set X is shattered by A, if for every Y X
some A ∈Asatisfies A X = Y ).
The VC inequality. Fix a family of subsets A over [n] of VC-dimension d.TheVC inequality
is the following result from empirical process theory:
Theorem 15.3 [27, p. 31] Let p
m
be an empirical distribution of m samples from p.Let
A be a family of subsets of VC-dimension d.Then
E [p p
m
A
] O(
d/m)
In other words, for m (d/
2
), with probability 9/10, the empirical distribution p
m
will
be -close to p in A-distance. We remark that this sample bound is asymptotically optimal
(up to a constant factor) for all values of d and .
Let A
k
be the collection of all subsets of the domain that can be expressed as unions
of at most k (disjoint) intervals. The intuition is that the collection A
2t
characterizes t-
histograms in a precise way. Consider the following algorithm for agnostically learning a
distribution p.
1. Draw m (t/
2
) samples from p.
2. Output the distribution h ∈H
t
that minimizes the quantity h p
m
A
k
(up to
an additive error γ = O()).
We choose the domain to be [0, 1] for simplicity. All the results that we will describe extend
straightforwardly to distributions ov er any interval or over a discrete set.
We use the notation poly(x), x R
+
, to denote a function that is bounded from above by a fixed
degree polynomial in x.
276 Handbook of Big Data
It is not difficult to show that this is an agnostic learning algorithm for H
t
.Themain
observation needed for the proof is that the A
2t
distance between two t-histograms is
identical to their total variation distance.
The algorithm in [14] uses a dynamic programming approach to efficiently perform
step (2) above, and its analysis relies on the VC inequality. More recently, a near-linear
time algorithm, that is, an algorithm with running time
˜
O(t/
2
), was developed in [15].
15.6.2.1 Applications to Learning Structured Distributions
The aforementioned agnostic learning algorithm has been used as the key algorithmic
ingredient to learn various classes of structured distributions. An additional ingredient
needed is a structural approximation result stating that for the underlying distribution
family C there exists an -approximation by t-histograms for an appropriately small value
of the parameter t. For example, by using the structural approximation results of [13], one
obtains near-sample-optimal and near-linear time estimators for various well-studied classes
including multimodal densities, monotone hazard rate (MHR) distributions, and others.
However, there exist distribution families where the approach of approximating by
histograms provably leads to suboptimal sample complexity. A prominent such example
is the class of log-concave distributions. This motivates the more general approach of
approximating by piecewise polynomials.
15.6.3 Agnostically Learning Piecewise Polynomials
We say that a distribution q over [0, 1] is a t-piecewise degree-d distribution if there is a
partition of [0, 1] into t disjoint intervals I
1
,...,I
t
such that q(x)=q
j
(x) for all x I
j
,
where each of q
1
,...,q
t
is a univariate polynomial of degree at most d.LetP
t,d
denote the
class of all t-piecewise degree-d pdf over [0, 1]. We have the following theorem.
Theorem 15.4 [14] Let p be any pdf over [0, 1]. There is an algorithm that, given t, d, ,
and
˜
O(t(d +1)/
2
) samples from p,runsintimepoly(t, d +1, 1/) and with high-probability
outputs an O(t)-piecewise degree-d hypothesis h such that d
TV
(p, h) O(opt
t,d
)+,where
opt
t,d
:= inf
r∈P
t,d
d
TV
(p, r) is the error of the best t-piecewise degree-d distribution for p.
It is shown in [14] that the number of samples used by the aforementioned algorithm is
information-theoretically optimal in all three parameters up to logarithmic factors.
The high-level approach to prove this theorem is similar to the one described in the
previous paragraph for the case of histograms. Let A
k
be the collection of all subsets of the
domain that can be expressed as unions of at most k =2t(d + 1) intervals. The intuition
is that the collection A
k
characterizes piecewise polynomials with t pieces and degree d.
Similarly, the following is an agnostic learning algorithm for p.
1. Draw m (t(d +1)/
2
) samples from p.
2. Output h ∈P
t,d
that minimizes the quantity h p
m
A
k
(up to an additive error
γ = O()).
We remark that the optimization problem in step 2 is nonconvex. However, it has sufficient
structure so that (an appropriately relaxed version of) it can be solved in polynomial time
by a combination of convex programming and dynamic programming.
15.6.3.1 Applications to Learning Structured Distributions
Theorem 15.4 yields near-sample-optimal and computationally efficient estimators for a
very broad class of structured distribution families, including arbitrary mixtures of natural
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.117.75