15. Learning Structured Distributions (2/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

272 Handbook of Big Data

Theorem 15.1. Let C be an arbitrary family of distributions and >0.LetC



⊆Cbe an

-cover of C of cardinality N. Then there is an algorithm that uses O(

−2

log N ) samples

from an unknown distribution p ∈Cand, with probability at least 9/10, outputs a distribution

h ∈C



that satisﬁes d

(h, p) ≤ 6.

An equivalent version of Theorem 15.1 (with a slightly diﬀerent terminology) was given by

Yatracos [74] (see also Chapter 7 of [27] for a detailed discussion). The above statement

appears as Lemma C.1 in [23].

As we explain in detail below, the algorithm implicit in the above theorem is not

computationally eﬃcient in general. Indeed, even assuming that we have an explicit

construction of a minimal size -cover, the algorithm takes time at least Ω(N/

)—that

is, exponential in its sample size.

We point out that the cover method can serve as a very useful tool in the design of

computationally eﬃcient learning algorithms. Indeed, many algorithms in the literature

work by constructing a small set S of candidate hypotheses with the guarantee that at

least one of them is close to the target distribution. The cover method can be used as a

postprocessing step to eﬃciently select an appropriate candidate in the set S. This simple

idea has been used in the design of fast proper learning algorithms for various natural classes

of distributions, including sums of independent integer random variables [20,22], Gaussian

mixtures [24,63], and other high-dimensional distributions [25].

We now provide a brief intuitive explanation of the argument in [23] establishing

Theorem 15.1. (The corresponding proof of [27,74] is quite similar.) Given a description

of the cover C



, the algorithm performs a tournament between the distributions in C



by running a hypothesis testing routine for every pair of distributions in C



. The obvious

implementation of this tournament takes time Ω(N

/

). Recent algorithmic work [24,63]

has improved this to nearly linear in N,namelyO(N log N/

). However, this running time

bound is still exponential in the sample complexity of the algorithm.

The hypothesis testing routine can be viewed as a simple competition between two

candidate hypothesis distributions. If at least one of the two candidate hypotheses is close

to the target distribution p, then with high probability over the samples drawn from p the

hypothesis testing routine selects as winner a candidate that is close to p. The algorithm

outputs a distribution in the cover C



that was never a loser (that is, won or tied against all

other distributions in the cover). We remark that the analysis of the algorithm is elementary,

relying only on the Chernoﬀ bound and the union bound.

Another important property of the cover method is its noise tolerance. It generalizes

naturally yielding an agnostic learning algorithm with the same sample complexity.

More speciﬁcally, for an arbitrary target distribution p with opt

(p) = inf

q∈C

(q, p),

the tournament-based algorithm makes O(

−2

log N ) i.i.d. draws from p and outputs a

hypothesis h in C



satisfying d

(h, p) ≤ O(opt

(p)+). The reader is referred to

Chapter 7.3 of [27] for an explicit proof of this fact.

The sample upper bound of O(

−2

log N) cannot be improved in general, in the sense

that there exist distribution families, where it is information-theoretically optimal up to

constant factors. In fact, Yang and Barron [73] showed that for many smooth nonparametric

classes the metric entropy number characterizes the sample complexity of learning. We note,

however, that metric entropy does not provide a characterization in general. There exist

distribution families, where the O(

−2

log N ) sample upper bound is suboptimal.

As a simple example, consider the set of all singleton distributions over [n], that is,

the class contains n distinct distributions each supported on a single point of the domain.

It is easy to see that Theorem 15.1 gives a sample upper bound of O(

−2

log n)forthis

case, while one sample suﬃces to uniquely specify the target distribution. For a more

natural example, consider the class of Poisson binomial distributions (PBDs), that is,

Learning Structured Distributions 273

sums



i=1

of n mutually independent Bernoulli random variables, X

,...,X

.Itisnot

diﬃcult to show that the covering number of the set of PBDs is Ω(n/). Hence, Theorem 15.1

cannot give an upper bound better than



Ω(

−2

) ·log n. On the other hand, a sample upper

bound of



O(

−2

) was recently obtained in [22]. These examples raise the following natural

question.

Open Problem 15.1. Is there a complexity measure of a distribution class C that

characterizes the sample complexity of learning C?

We recall that the Vapnik–Chervonenkis dimension of a class of Boolean functions plays

such a role in Valiant’s PAC model [66], that is, it tightly characterizes the number of

examples that are required to PAC learn an arbitrary function from the class.

15.6 Learning Univariate Structured Distributions

In this section, we consider the problem of nonproper learning of an unknown univariate

probability distribution, that is, a distribution with a density function p :Ω→ R

,where

the sample space Ω is a subset of the real line. We focus on two basic cases: (1) Ω = [n],

where the set [n]isviewedasanorderedsetand(2)Ω=[a, b] with a ≤ b ∈ R. Given a

family C of univariate distributions, can we design a sample-optimal and computationally

eﬃcient learning algorithm for C? Can we achieve this goal in the more challenging agnostic

setting? It turns out that the answer to both questions turns out to be yes for broad classes

of structured families C.

If the target distribution is arbitrary, the learning problem is well understood. More

speciﬁcally, suppose that the class C of target distributions is the set of all distributions over

[n]. It is a folklore fact that Θ(n/

) samples are necessary and suﬃcient for learning within

the total variation distance  in this case. The underlying algorithm is also straightforward,

that is, output the empirical distribution. For distributions over very large domains, a linear

dependence on n is of course impractical, both from running time and sample complexity

perspective.

For continuous distributions the learning problem is not solvable without any assump-

tions. Indeed, learning an arbitrary distribution over [0, 1] to any constant accuracy  < 1

requires inﬁnitely many samples. This follows, for example, from the aforementioned discrete

lower bound for n →∞. Hence, it is important to focus our attention on structured

distribution families.

In the main part of this section, we describe recent work from theoretical computer

science that yields sample-optimal and computationally eﬃcient algorithms for learning

broad classes of structured distributions. The main idea of the approach is that the existence

of good piecewise polynomial approximations for a family C can be leveraged for the design

of eﬃcient learning algorithms for C. The approach is inspired and motivated by classical

results in statistics and combines a variety of techniques from algorithms, probability, and

approximation theory.

Piecewise polynomials (splines) have been extensively used in statistics as tools for

inference tasks, including density estimation, see, for example, [61,62,71,72]. We remark

that splines in statistics have been used in the context of the MLE, which is very diﬀerent

than the aforementioned approach. Moreover, the degree of the splines used in statistical

literature is typically bounded by a small constant.

In Section 15.6.1, we describe classical work in statistics on learning monotone densities

that served as an inspiration for the piecewise polynomial approach. In Section 15.6.2, we

274 Handbook of Big Data

describe how to use piecewise constant approximations for learning and argue why it is

insuﬃcient for some cases. Finally, in Section 15.6.3, we describe the general approach in

detail.

15.6.1 Learning Monotone Distributions

Monotonicity is arguably one of the simplest shape constraints. Learning a monotone density

was one of the ﬁrst problems studied in this context by Grenander [38]. We present a result

by Birg´e [6,7], who gave a sample-optimal and computationally eﬃcient algorithm for this

problem. More speciﬁcally, Birg´e showed the following.

Theorem 15.2 [6,7] Fix L, H > 0;letM be the set of nonincreasing densities

p :[0,L] → [0,H]. There is a computationally eﬃcient algorithm that given m =

O((1/

)log(1+H ·L)) samples from an arbitrary p ∈Moutputs a hypothesis h satisfying

(h, p) ≤  with probability at least 9/10. Moreover, Ω((1/

) log(1 + H ·L)) samples are

information-theoretically necessary for this problem.

An adaptation of the above theorem holds for monotone distributions over [n], yielding

an eﬃcient algorithm with optimal sample complexity of O((1/

)logn) for the discrete

setting as well.

To sketch the proof of this theorem, we will need a few deﬁnitions. Given m independent

samples s

,...,s

,drawnfromadensityp :Ω→ R

,theempirical distribution p

is the

discrete distribution supported on {s

,...,s

} deﬁned as follows: for all z ∈ Ω, p

(z)=

|{j ∈ [m] | s

= z}|/m.

For a measurable function f : I → R

and A ⊆ I, we will denote f (A)=



f(x)dx.

Deﬁnition 15.1. A function f : I → R is called a t-histogram if it is piecewise constant

with at most t interval pieces. For a function f : I → R and an interval partition {I

,...,I

}

of the domain, the ﬂattened version

f of f is the t-histogram deﬁned by

f(x)=f(I

)/|I

for all x ∈ I

Birg´e’s algorithm works as follows [7]: it partitions the domain into a set of intervals and

outputs the ﬂattened empirical distribution on those intervals. Its correctness relies on an

approximation lemma that he proves.

Lemma 15.1 [7] Fix L, H > 0. There exists a partition of [0,H] into t = O((1/)log(1+

H · L)) intervals such that for any p ∈Mit holds d

(¯p, p) ≤ .

An analog of the lemma holds for discrete monotone distributions over [n] establishing a

bound of t = O((1/)logn) on the number of intervals.

Note that the interval decomposition of the lemma is oblivious, in the sense that it

does not depend on the underlying monotone density. This is a very strong guarantee that

facilitates the learning algorithm. Indeed, given the guarantee of the lemma, the algorithm

is straightforward. The monotone learning problem is reduced to the problem of learning a

distribution over a known ﬁnite support of cardinality t = O((1/) log(1 + H · L)).

In summary, one can break Birg´e’s approach in two conceptual steps:

• Prove that any monotone distribution is -close in total variation distance to a t-

histogram distribution, where the parameter t is small.

• Agnostically learn the target distribution using the class of t-histogram distributions as

a hypothesis class.

This scheme is quite general and can be applied to any structured distribution class as

long as there exists a good piecewise constant approximation. In general, such a histogram

Learning Structured Distributions 275

approximation may not be ﬁxed for all distributions in the family. Indeed, this is the case

for most natural families of distributions. To handle this case, we need an agnostic learning

algorithm for t-histogram distributions with an unknown partition.

15.6.2 Agnostically Learning Histograms

In this section, we study the problem of agnostically learning t-histogram distributions

with an unknown partition. Formally, given a bound t on the number of intervals, we

want to design a computationally eﬃcient algorithm that uses an optimal sample size

and approximates the target distribution nearly as accurately as the best t-histogram.

As sketched in the previous section, such an algorithm would have several applications

in learning classes of shape-restricted densities.

Denote by H

the family of t-histogram distributions over [0, 1].

∗

The ﬁrst step is to

determine the optimal sample complexity of the learning problem. It is easy to see that

Ω(t/

) is a lower bound and simple arguments can be used to get an upper bound of

O(t/

) using the cover method described in Section 15.5.

The problem of agnostically learning t-histogram distributions with

O(t/

)samples

and poly(t/) time

†

is algorithmically nontrivial. If one is willing to relax the sample size

to O(t/

), it is easy to obtain a computationally eﬃcient algorithm [13,21]. The ﬁrst

eﬃcient algorithm with near-optimal-sample complexity was obtained in [14] and is based

on dynamic programming.

To sketch the algorithm in [14] we will need a more general metric between distributions

that generalizes the total variation distance. Fix a family of subsets A over [0, 1]. We deﬁne

the A-distance between p and q by p − q

:= max

A∈A

|p(A) − q(A)|. (Note that if A

is the set of all measurable subsets of the domain, the A-distance is identical to the total

variation distance.) The Vapnik–Chervonenkis (VC)-dimension of A is the maximum size

of a subset X ⊆ [0, 1] that is shattered by A (a set X is shattered by A, if for every Y ⊆ X

some A ∈Asatisﬁes A ∩ X = Y ).

The VC inequality. Fix a family of subsets A over [n] of VC-dimension d.TheVC inequality

is the following result from empirical process theory:

Theorem 15.3 [27, p. 31] Let p

be an empirical distribution of m samples from p.Let

A be a family of subsets of VC-dimension d.Then

E [p − p



] ≤ O(



d/m)

In other words, for m =Ω(d/

), with probability 9/10, the empirical distribution p

will

be -close to p in A-distance. We remark that this sample bound is asymptotically optimal

(up to a constant factor) for all values of d and .

Let A

be the collection of all subsets of the domain that can be expressed as unions

of at most k (disjoint) intervals. The intuition is that the collection A

characterizes t-

histograms in a precise way. Consider the following algorithm for agnostically learning a

distribution p.

1. Draw m =Θ(t/

) samples from p.

2. Output the distribution h ∈H

that minimizes the quantity h − p



(up to

an additive error γ = O()).

∗

We choose the domain to be [0, 1] for simplicity. All the results that we will describe extend

straightforwardly to distributions ov er any interval or over a discrete set.

†

We use the notation poly(x), x ∈ R

, to denote a function that is bounded from above by a ﬁxed

degree polynomial in x.

276 Handbook of Big Data

It is not diﬃcult to show that this is an agnostic learning algorithm for H

.Themain

observation needed for the proof is that the A

distance between two t-histograms is

identical to their total variation distance.

The algorithm in [14] uses a dynamic programming approach to eﬃciently perform

step (2) above, and its analysis relies on the VC inequality. More recently, a near-linear

time algorithm, that is, an algorithm with running time

O(t/

), was developed in [15].

15.6.2.1 Applications to Learning Structured Distributions

The aforementioned agnostic learning algorithm has been used as the key algorithmic

ingredient to learn various classes of structured distributions. An additional ingredient

needed is a structural approximation result stating that for the underlying distribution

family C there exists an -approximation by t-histograms for an appropriately small value

of the parameter t. For example, by using the structural approximation results of [13], one

obtains near-sample-optimal and near-linear time estimators for various well-studied classes

including multimodal densities, monotone hazard rate (MHR) distributions, and others.

However, there exist distribution families where the approach of approximating by

histograms provably leads to suboptimal sample complexity. A prominent such example

is the class of log-concave distributions. This motivates the more general approach of

approximating by piecewise polynomials.

15.6.3 Agnostically Learning Piecewise Polynomials

We say that a distribution q over [0, 1] is a t-piecewise degree-d distribution if there is a

partition of [0, 1] into t disjoint intervals I

,...,I

such that q(x)=q

(x) for all x ∈ I

where each of q

,...,q

is a univariate polynomial of degree at most d.LetP

t,d

denote the

class of all t-piecewise degree-d pdf over [0, 1]. We have the following theorem.

Theorem 15.4 [14] Let p be any pdf over [0, 1]. There is an algorithm that, given t, d, ,

and

O(t(d +1)/

) samples from p,runsintimepoly(t, d +1, 1/) and with high-probability

outputs an O(t)-piecewise degree-d hypothesis h such that d

(p, h) ≤ O(opt

t,d

)+,where

opt

t,d

:= inf

r∈P

t,d

(p, r) is the error of the best t-piecewise degree-d distribution for p.

It is shown in [14] that the number of samples used by the aforementioned algorithm is

information-theoretically optimal in all three parameters up to logarithmic factors.

The high-level approach to prove this theorem is similar to the one described in the

previous paragraph for the case of histograms. Let A

be the collection of all subsets of the

domain that can be expressed as unions of at most k =2t(d + 1) intervals. The intuition

is that the collection A

characterizes piecewise polynomials with t pieces and degree d.

Similarly, the following is an agnostic learning algorithm for p.

1. Draw m =Θ(t(d +1)/

) samples from p.

2. Output h ∈P

t,d

that minimizes the quantity h −p



(up to an additive error

γ = O()).

We remark that the optimization problem in step 2 is nonconvex. However, it has suﬃcient

structure so that (an appropriately relaxed version of) it can be solved in polynomial time

by a combination of convex programming and dynamic programming.

15.6.3.1 Applications to Learning Structured Distributions

Theorem 15.4 yields near-sample-optimal and computationally eﬃcient estimators for a

very broad class of structured distribution families, including arbitrary mixtures of natural

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. Learning Structured Distributions (2/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
15. Learning Structured Distributions (2/4)