15
Learning Structured Distributions
Ilias Diakonikolas
CONTENTS
15.1 Introduction .................................................................... 267
15.2 Historical Background ......................................................... 268
15.3 Definitions and Preliminaries .................................................. 269
15.4 Types of Structured Distributions ............................................. 270
15.4.1 Shape-Constrained Distributions ...................................... 270
15.4.2 Aggregation of Structured Distributions .............................. 271
15.5 The Cover Method and Sample Bounds ....................................... 271
15.6 Learning Univariate Structured Distributions ................................. 273
15.6.1 Learning Monotone Distributions ..................................... 274
15.6.2 Agnostically Learning Histograms ..................................... 275
15.6.2.1 Applications to Learning Structured Distributions ........ 276
15.6.3 Agnostically Learning Piecewise Polynomials ......................... 276
15.6.3.1 Applications to Learning Structured Distributions ........ 276
15.7 Learning Multivariate Structured Distributions ............................... 277
15.8 Conclusions and Future Directions ............................................ 279
References ............................................................................. 279
15.1 Introduction
Discovering a hidden structure in data is one of the cornerstones of modern data analysis.
Because of the diversity and complexity of modern datasets, this is a very challenging task
and the role of efficient algorithms is of paramount importance in this context. The majority
of available datasets are in raw and unstructured form, consisting of example points without
corresponding labels. A large class of unlabeled datasets can be modeled as samples from a
probability distribution over a very large domain. An important goal in the exploration of
these datasets is understanding the underlying distributions.
Estimating distributions from samples is a paradigmatic and fundamental unsupervised
learning problem that has been studied in statistics since the late nineteenth century,
starting with the pioneering work of Pearson [55]. During the past couple of decades,
there has been a large body of work in computer science on this topic with a focus on
computational efficiency.
The area of distribution estimation is well motivated in its own right and has seen a
recent surge of research activity, in part due to the ubiquity of structured distributions in the
natural and social sciences. Such structural properties of distributions are sometimes direct
267
268 Handbook of Big Data
consequences of the underlying application problem or they are a plausible explanation of
the model under investigation.
In this chapter, we give a survey of both classical and modern techniques for distribution
estimation with a focus on recent algorithmic ideas developed in theoretical computer
science. These ideas have led to computationally and statistically efficient algorithms for
learning broad families of models. For the sake of concreteness, we illustrate these ideas
with specific examples. Finally, we highlight outstanding challenges and research directions
for future work.
15.2 Historical Background
The construction of an estimate of an unknown probability density function (pdf) based
on observed data is a classical problem in statistics with a rich history and extensive
literature (see, e.g., [4,26,27,58,60]). A number of generic methods have been proposed
in the mathematical statistics literature, including histograms, kernels, nearest neighbor
estimators, orthogonal series estimators, maximum likelihood, and more. The reader is
referred to [44] for a survey of these techniques.
The oldest and most natural estimator is the histogram, first introduced by Pearson [55].
Given a number of samples (observations) from a pdf, the method partitions the domain
into a number of bins and outputs the empirical density that is constant within each bin.
It should be emphasized that the number of bins to be used and the width and location of
each bin are unspecified by the method. The problem of finding the optimal number and
location of the bins to minimize the error is an inherently algorithmic question, because the
ultimate goal is to obtain learning algorithms that are computationally efficient.
Suppose that we are given a number of samples from a density that we believe is from (or
very close to) a given family C, for example, it is a mixture of a small number of Gaussian
distributions. Our goal is to estimate the target distribution in a precise, well-defined way.
There are three different goals in this context.
1. In nonproper learning (density estimation) the goal is to output an approximation
to the target density without any constraints on its representation. That is, the
output distribution is not necessarily a member of the family C.
2. In proper learning the goal is to output a density in C that is a good approximation
to the target density.
3. In parameter learning the goal is to identify the parameters of the target distri-
bution, for example, the mixing weights and the parameters of the components
up to a desired accuracy. (The notion of parameter learning is well defined for
parametric classes C.)
Note that nonproper learning and proper learning are equivalent in terms of sample size,
given any (nonproper) hypothesis we can do a brute-force search to find its closest density
in C. However, it is not clear whether this computation can be performed efficiently.
We remark that the task of parameter learning is possible only under certain separation
assumptions on the components. Even under such assumptions, it can be a more demanding
task than proper learning. In particular, it is possible that two distinct distributions in C,
whose parameters are far from each other give rise to densities that are close to each other.
Moreover, parameter learning strongly relies on the assumption that there is no noise in the
Learning Structured Distributions 269
data, and hence it may not be meaningful in many realistic settings. These facts motivate
the study of proper learning algorithms in the noisy setting.
The focus of this chapter is on general techniques and efficient algorithms for density
estimation and proper learning. Because of the space constraints, we do not elaborate on
the algorithmic methods used for the problem of parameter learning.
The structure of this chapter is as follows: after some basic definitions (Section 15.3),
in Section 15.4, we give a classification of the types of distribution families studied in
the literature. In Section 15.5, we describe a classical method from statistics to efficiently
select from a given a set of candidate hypothesis distributions. Section 15.6 describes recent
algorithmic ideas from theoretical computer science to learn structured univariate densities.
Section 15.7 discusses the challenging case of high-dimensional distributions. We conclude
with some future directions in Section 15.8.
15.3 Definitions and Preliminaries
We consider a standard notion of learning an unknown probability distribution from
samples [46], which is a natural analog of Valiant’s well-known probably approximately
correct (PAC) model for learning Boolean functions [67] to the unsupervised setting of
learning an unknown probability distribution. We remark that our definition is essentially
equivalent to the notion of minimax rate of convergence in statistics [27].
Given access to independent draws from an unknown pdf p, the goal is to approximate
p in a certain well-defined sense. More specifically, the goal of a learning algorithm is to
output a hypothesis distribution h that is close to the target distribution p. One can choose
various metrics to measure the distance between distributions. Throughout this chapter, we
measure the closeness between distributions using the statistical distance or total variation
distance. The statistical distance between two densities p, q R
+
is defined as
d
TV
(p, q)=
1
2
p q
1
=
1
2
Ω
|p(x) q(x)|dx
(When Ω is discrete the above integral is replaced by a sum.)
A distribution learning problem is defined by a class C of probability distributions over a
domain Ω. The domain Ω may be discrete, for example, Ω = [n]:={1,...,n}, or continuous,
for example, Ω = R, one-dimensional or high-dimensional. In the noiseless setting, we are
promised that p ∈Cand the goal is to construct a hypothesis h such that with probability
at least 9/10
the total variation distance d
TV
(h, p) between h and p is at most ,where
> 0 is the accuracy parameter.
The noisy or agnostic model captures the situation of having adversarial noise in the
data. In this setting, we do not make any assumptions about the target density p and the
goal is to find a hypothesis h that is almost as accurate as the best approximation of p by
any distribution in C. Formally, given > 0 and sample access to a target distribution p,
the goal of an agnostic learning algorithm for C is to compute a hypothesis distribution h
such that, with probability at least 9/10, it holds
d
TV
(h, p) α · opt
C
(p)+
We note that, using standard techniques, the confidence probability can be boosted to 1 δ, for any
δ > 0, with a multiplicative overhead of O(log(1/δ)) in the sample size.
270 Handbook of Big Data
where opt
C
(p):=inf
q∈C
d
TV
(q, p), that is, opt
C
(p) is the statistical distance between p and
the closest distribution to it in C and α 1 is a universal constant.
We will use the following two standard metrics to measure the performance of a learning
algorithm: (1) the sample complexity, that is, the number of samples drawn by the algorithm,
and (2) the computational complexity, that is, the worst-case running time of the algorithm.
An algorithm is statistically efficient if its sample complexity is information-theoretically
optimal, and it is computationally efficient if its computational complexity is polynomial
in its sample complexity. The gold standard is a statistically efficient algorithm whose
computational complexity is linear in its sample size.
As mentioned in the introduction, proper and nonproper learning of any class C are
equivalent in terms of sample complexity, but not necessarily equivalent in terms of
computational complexity. We also remark that, for broad classes of distributions C, agnostic
learning and noiseless learning are equivalent in terms of sample complexity. However,
designing computationally efficient agnostic learning algorithms is, in general, a much more
challenging task.
15.4 Types of Structured Distributions
In this section, we provide a broad categorization of the most common types of structured
distributions that have been considered in the statistics and computer science literatures.
We also briefly summarize a few standard methods to learn such distributions in statistics.
In the following sections, we will describe a set of algorithmic techniques that lead to
provably efficient learning algorithms for most of these distribution families.
15.4.1 Shape-Constrained Distributions
For distributions over R
d
(or a discrete d-dimensional subset, e.g., [n]
d
), a very natural type
of structure to consider is some sort of shape constraint on the pdf defining the distribution.
Statistical research in this area started in the 1950s and the reader is referred to [4]
for a summary of the early work. Most of the literature has focused on one-dimensional
distributions, with a few exceptions during the past decade. Various structural restrictions
have been studied over the years, starting from monotonicity, unimodality, convexity,
and concavity [6,7,9,12,35,38,39,41,45,56,70], and more recently focusing on structural
restrictions such as log-concavity and k-monotonicity [1–3,32,37,49,69]. The reader is
referred to [40] for a recent book on the subject.
The most common method used in statistics to address shape-constrained inference
problems is the maximum likelihood estimator (MLE) and its variants. The challenge is to
analyze the performance of the MLE in this context. It turns out that for several univariate
learning problems of this sort the MLE performs quite well in terms of statistical efficiency.
While the MLE is very popular and quite natural, there exist natural inference problems
(see, e.g., [10]) where it performs poorly in terms of statistical and computational efficiency,
as well as noise tolerance.
A related line of work in mathematical statistics [28–30,47,48] uses nonlinear estimators
based on wavelet techniques to learn continuous distributions whose densities satisfy various
smoothness constraints, such as Triebel and Besov-type smoothness. We remark that the
focus of these works is on the statistical efficiency of the proposed estimators and not on
computational complexity.
Learning Structured Distributions 271
15.4.2 Aggregation of Structured D istributions
Aggregations of structured random variables are very popular as they can model many
rich phenomena. Two prominent examples of this sort are mixture models and sums of
simple random variables. Mixtures of structured distributions have received much attention
in statistics [51,57,64] and, more recently, in theoretical computer science [19,54].
We remark that early statistical work on mixture models focuses on parameter learning.
In practice, this problem is typically handled with nonconvex heuristics such as the
expectation–maximization (EM) algorithm. Recent algorithmic techniques rely on the
moment problem and tensor decomposition. However, such algorithms lead to sample
complexities that are inherently exponential in the number of components.
Learning sums of simple random variables has received recent attention in the computer
science literature [20,22]. Such distributions have various applications in areas such as survey
sampling, case-control studies, and survival analysis (see, e.g., [16] for the case of sums of
indicators).
15.5 The Cover Method and Sample Bounds
The first fundamental question that arises in the context of learning an unknown probability
distribution is information-theoretic:
What is the minimum sample size that is necessary and sufficient to learn an unknown
p ∈Cup to total variation distance ?
While this question has been extensively investigated in statistics, information theory,
and, more recently, computer science, the information-theoretically optimal sample size is
not yet understood, even for some relatively simple families of distributions. It turns out
that the optimal sample complexity depends on the structure of the underlying density class
in a subtle way.
In this section, we describe a general powerful method that yields nearly tight upper
bounds on the sample complexity of learning. The method, which we term the cover
method, is classical in statistics and information theory and has its roots in early work
of A. N. Kolmogorov. The high-level idea is to analyze the structure of the metric space
C under total variation distance. The method postulates that the structure of this metric
space characterizes the sample complexity of learning. To describethemethodindetailwe
introduce some basic terminology.
Let (X,d) be a metric space. Given δ > 0, a subset Y⊆Xis said to be a δ-cover
of X with respect to the metric d : X
2
R
+
if for every x ∈Xthere exists some
y ∈Ysuch that d(x, y) δ. There may exist many δ-covers of X, but one is typically
interested in those with minimum cardinality. The δ-covering number of (X,d)isthe
minimum cardinality of any δ-cover of X. Intuitively, the covering number captures the
size of the metric space.
Covering numbers—and their logarithms, known as metric entropy numbers—were first
defined by Kolmogorov in the 1950s and have since played a central role in a number of
areas, including approximation theory, geometric functional analysis (see, e.g., [8,31,53] and
[11,33,50,52]), information theory, statistics, and machine learning (see, e.g., [5,42,43,73,74]
and [27,65,68]).
In the context of distribution learning, the cover method is summarized in the following
theorem.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.186.167