15. Learning Structured Distributions (1/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Learning Structured Distributions

Ilias Diakonikolas

CONTENTS

15.1 Introduction .................................................................... 267

15.2 Historical Background ......................................................... 268

15.3 Deﬁnitions and Preliminaries .................................................. 269

15.4 Types of Structured Distributions ............................................. 270

15.4.1 Shape-Constrained Distributions ...................................... 270

15.4.2 Aggregation of Structured Distributions .............................. 271

15.5 The Cover Method and Sample Bounds ....................................... 271

15.6 Learning Univariate Structured Distributions ................................. 273

15.6.1 Learning Monotone Distributions ..................................... 274

15.6.2 Agnostically Learning Histograms ..................................... 275

15.6.2.1 Applications to Learning Structured Distributions ........ 276

15.6.3 Agnostically Learning Piecewise Polynomials ......................... 276

15.6.3.1 Applications to Learning Structured Distributions ........ 276

15.7 Learning Multivariate Structured Distributions ............................... 277

15.8 Conclusions and Future Directions ............................................ 279

References ............................................................................. 279

15.1 Introduction

Discovering a hidden structure in data is one of the cornerstones of modern data analysis.

Because of the diversity and complexity of modern datasets, this is a very challenging task

and the role of eﬃcient algorithms is of paramount importance in this context. The majority

of available datasets are in raw and unstructured form, consisting of example points without

corresponding labels. A large class of unlabeled datasets can be modeled as samples from a

probability distribution over a very large domain. An important goal in the exploration of

these datasets is understanding the underlying distributions.

Estimating distributions from samples is a paradigmatic and fundamental unsupervised

learning problem that has been studied in statistics since the late nineteenth century,

starting with the pioneering work of Pearson [55]. During the past couple of decades,

there has been a large body of work in computer science on this topic with a focus on

computational eﬃciency.

The area of distribution estimation is well motivated in its own right and has seen a

recent surge of research activity, in part due to the ubiquity of structured distributions in the

natural and social sciences. Such structural properties of distributions are sometimes direct

267

268 Handbook of Big Data

consequences of the underlying application problem or they are a plausible explanation of

the model under investigation.

In this chapter, we give a survey of both classical and modern techniques for distribution

estimation with a focus on recent algorithmic ideas developed in theoretical computer

science. These ideas have led to computationally and statistically eﬃcient algorithms for

learning broad families of models. For the sake of concreteness, we illustrate these ideas

with speciﬁc examples. Finally, we highlight outstanding challenges and research directions

for future work.

15.2 Historical Background

The construction of an estimate of an unknown probability density function (pdf) based

on observed data is a classical problem in statistics with a rich history and extensive

literature (see, e.g., [4,26,27,58,60]). A number of generic methods have been proposed

in the mathematical statistics literature, including histograms, kernels, nearest neighbor

estimators, orthogonal series estimators, maximum likelihood, and more. The reader is

referred to [44] for a survey of these techniques.

The oldest and most natural estimator is the histogram, ﬁrst introduced by Pearson [55].

Given a number of samples (observations) from a pdf, the method partitions the domain

into a number of bins and outputs the empirical density that is constant within each bin.

It should be emphasized that the number of bins to be used and the width and location of

each bin are unspeciﬁed by the method. The problem of ﬁnding the optimal number and

location of the bins to minimize the error is an inherently algorithmic question, because the

ultimate goal is to obtain learning algorithms that are computationally eﬃcient.

Suppose that we are given a number of samples from a density that we believe is from (or

very close to) a given family C, for example, it is a mixture of a small number of Gaussian

distributions. Our goal is to estimate the target distribution in a precise, well-deﬁned way.

There are three diﬀerent goals in this context.

1. In nonproper learning (density estimation) the goal is to output an approximation

to the target density without any constraints on its representation. That is, the

output distribution is not necessarily a member of the family C.

2. In proper learning the goal is to output a density in C that is a good approximation

to the target density.

3. In parameter learning the goal is to identify the parameters of the target distri-

bution, for example, the mixing weights and the parameters of the components

up to a desired accuracy. (The notion of parameter learning is well deﬁned for

parametric classes C.)

Note that nonproper learning and proper learning are equivalent in terms of sample size,

given any (nonproper) hypothesis we can do a brute-force search to ﬁnd its closest density

in C. However, it is not clear whether this computation can be performed eﬃciently.

We remark that the task of parameter learning is possible only under certain separation

assumptions on the components. Even under such assumptions, it can be a more demanding

task than proper learning. In particular, it is possible that two distinct distributions in C,

whose parameters are far from each other give rise to densities that are close to each other.

Moreover, parameter learning strongly relies on the assumption that there is no noise in the

Learning Structured Distributions 269

data, and hence it may not be meaningful in many realistic settings. These facts motivate

the study of proper learning algorithms in the noisy setting.

The focus of this chapter is on general techniques and eﬃcient algorithms for density

estimation and proper learning. Because of the space constraints, we do not elaborate on

the algorithmic methods used for the problem of parameter learning.

The structure of this chapter is as follows: after some basic deﬁnitions (Section 15.3),

in Section 15.4, we give a classiﬁcation of the types of distribution families studied in

the literature. In Section 15.5, we describe a classical method from statistics to eﬃciently

select from a given a set of candidate hypothesis distributions. Section 15.6 describes recent

algorithmic ideas from theoretical computer science to learn structured univariate densities.

Section 15.7 discusses the challenging case of high-dimensional distributions. We conclude

with some future directions in Section 15.8.

15.3 Deﬁnitions and Preliminaries

We consider a standard notion of learning an unknown probability distribution from

samples [46], which is a natural analog of Valiant’s well-known probably approximately

correct (PAC) model for learning Boolean functions [67] to the unsupervised setting of

learning an unknown probability distribution. We remark that our deﬁnition is essentially

equivalent to the notion of minimax rate of convergence in statistics [27].

Given access to independent draws from an unknown pdf p, the goal is to approximate

p in a certain well-deﬁned sense. More speciﬁcally, the goal of a learning algorithm is to

output a hypothesis distribution h that is close to the target distribution p. One can choose

various metrics to measure the distance between distributions. Throughout this chapter, we

measure the closeness between distributions using the statistical distance or total variation

distance. The statistical distance between two densities p, q :Ω→ R

is deﬁned as

(p, q)=

p − q



|p(x) − q(x)|dx

(When Ω is discrete the above integral is replaced by a sum.)

A distribution learning problem is deﬁned by a class C of probability distributions over a

domain Ω. The domain Ω may be discrete, for example, Ω = [n]:={1,...,n}, or continuous,

for example, Ω = R, one-dimensional or high-dimensional. In the noiseless setting, we are

promised that p ∈Cand the goal is to construct a hypothesis h such that with probability

at least 9/10

∗

the total variation distance d

(h, p) between h and p is at most ,where

 > 0 is the accuracy parameter.

The noisy or agnostic model captures the situation of having adversarial noise in the

data. In this setting, we do not make any assumptions about the target density p and the

goal is to ﬁnd a hypothesis h that is almost as accurate as the best approximation of p by

any distribution in C. Formally, given  > 0 and sample access to a target distribution p,

the goal of an agnostic learning algorithm for C is to compute a hypothesis distribution h

such that, with probability at least 9/10, it holds

(h, p) ≤ α · opt

(p)+

∗

We note that, using standard techniques, the conﬁdence probability can be boosted to 1 − δ, for any

δ > 0, with a multiplicative overhead of O(log(1/δ)) in the sample size.

270 Handbook of Big Data

where opt

(p):=inf

q∈C

(q, p), that is, opt

(p) is the statistical distance between p and

the closest distribution to it in C and α ≥ 1 is a universal constant.

We will use the following two standard metrics to measure the performance of a learning

algorithm: (1) the sample complexity, that is, the number of samples drawn by the algorithm,

and (2) the computational complexity, that is, the worst-case running time of the algorithm.

An algorithm is statistically eﬃcient if its sample complexity is information-theoretically

optimal, and it is computationally eﬃcient if its computational complexity is polynomial

in its sample complexity. The gold standard is a statistically eﬃcient algorithm whose

computational complexity is linear in its sample size.

As mentioned in the introduction, proper and nonproper learning of any class C are

equivalent in terms of sample complexity, but not necessarily equivalent in terms of

computational complexity. We also remark that, for broad classes of distributions C, agnostic

learning and noiseless learning are equivalent in terms of sample complexity. However,

designing computationally eﬃcient agnostic learning algorithms is, in general, a much more

challenging task.

15.4 Types of Structured Distributions

In this section, we provide a broad categorization of the most common types of structured

distributions that have been considered in the statistics and computer science literatures.

We also brieﬂy summarize a few standard methods to learn such distributions in statistics.

In the following sections, we will describe a set of algorithmic techniques that lead to

provably eﬃcient learning algorithms for most of these distribution families.

15.4.1 Shape-Constrained Distributions

For distributions over R

(or a discrete d-dimensional subset, e.g., [n]

), a very natural type

of structure to consider is some sort of shape constraint on the pdf deﬁning the distribution.

Statistical research in this area started in the 1950s and the reader is referred to [4]

for a summary of the early work. Most of the literature has focused on one-dimensional

distributions, with a few exceptions during the past decade. Various structural restrictions

have been studied over the years, starting from monotonicity, unimodality, convexity,

and concavity [6,7,9,12,35,38,39,41,45,56,70], and more recently focusing on structural

restrictions such as log-concavity and k-monotonicity [1–3,32,37,49,69]. The reader is

referred to [40] for a recent book on the subject.

The most common method used in statistics to address shape-constrained inference

problems is the maximum likelihood estimator (MLE) and its variants. The challenge is to

analyze the performance of the MLE in this context. It turns out that for several univariate

learning problems of this sort the MLE performs quite well in terms of statistical eﬃciency.

While the MLE is very popular and quite natural, there exist natural inference problems

(see, e.g., [10]) where it performs poorly in terms of statistical and computational eﬃciency,

as well as noise tolerance.

A related line of work in mathematical statistics [28–30,47,48] uses nonlinear estimators

based on wavelet techniques to learn continuous distributions whose densities satisfy various

smoothness constraints, such as Triebel and Besov-type smoothness. We remark that the

focus of these works is on the statistical eﬃciency of the proposed estimators and not on

computational complexity.

Learning Structured Distributions 271

15.4.2 Aggregation of Structured D istributions

Aggregations of structured random variables are very popular as they can model many

rich phenomena. Two prominent examples of this sort are mixture models and sums of

simple random variables. Mixtures of structured distributions have received much attention

in statistics [51,57,64] and, more recently, in theoretical computer science [19,54].

We remark that early statistical work on mixture models focuses on parameter learning.

In practice, this problem is typically handled with nonconvex heuristics such as the

expectation–maximization (EM) algorithm. Recent algorithmic techniques rely on the

moment problem and tensor decomposition. However, such algorithms lead to sample

complexities that are inherently exponential in the number of components.

Learning sums of simple random variables has received recent attention in the computer

science literature [20,22]. Such distributions have various applications in areas such as survey

sampling, case-control studies, and survival analysis (see, e.g., [16] for the case of sums of

indicators).

15.5 The Cover Method and Sample Bounds

The ﬁrst fundamental question that arises in the context of learning an unknown probability

distribution is information-theoretic:

What is the minimum sample size that is necessary and suﬃcient to learn an unknown

p ∈Cup to total variation distance ?

While this question has been extensively investigated in statistics, information theory,

and, more recently, computer science, the information-theoretically optimal sample size is

not yet understood, even for some relatively simple families of distributions. It turns out

that the optimal sample complexity depends on the structure of the underlying density class

in a subtle way.

In this section, we describe a general powerful method that yields nearly tight upper

bounds on the sample complexity of learning. The method, which we term the cover

method, is classical in statistics and information theory and has its roots in early work

of A. N. Kolmogorov. The high-level idea is to analyze the structure of the metric space

C under total variation distance. The method postulates that the structure of this metric

space characterizes the sample complexity of learning. To describethemethodindetailwe

introduce some basic terminology.

Let (X,d) be a metric space. Given δ > 0, a subset Y⊆Xis said to be a δ-cover

of X with respect to the metric d : X

→ R

if for every x ∈Xthere exists some

y ∈Ysuch that d(x, y) ≤ δ. There may exist many δ-covers of X, but one is typically

interested in those with minimum cardinality. The δ-covering number of (X,d)isthe

minimum cardinality of any δ-cover of X. Intuitively, the covering number captures the

size of the metric space.

Covering numbers—and their logarithms, known as metric entropy numbers—were ﬁrst

deﬁned by Kolmogorov in the 1950s and have since played a central role in a number of

areas, including approximation theory, geometric functional analysis (see, e.g., [8,31,53] and

[11,33,50,52]), information theory, statistics, and machine learning (see, e.g., [5,42,43,73,74]

and [27,65,68]).

In the context of distribution learning, the cover method is summarized in the following

theorem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. Learning Structured Distributions (1/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
15. Learning Structured Distributions (1/4)