16
Penalized Estimation in Complex Models
Jacob Bien and Daniela Witten
CONTENTS
16.1 Introduction .................................................................... 285
16.2 Loss Functions ................................................................. 287
16.2.1 Linear Regression ...................................................... 287
16.2.2 Matrix Completion .................................................... 287
16.2.3 Precision Matrix Estimation .......................................... 288
16.3 Structure-Inducing Convex Penalties .......................................... 288
16.3.1 The Lasso .............................................................. 288
16.3.2 The Group Lasso ...................................................... 290
16.3.3 The Nuclear Norm ..................................................... 291
16.4 Algorithms ..................................................................... 291
16.4.1 Blockwise Coordinate Descent ........................................ 292
16.4.2 Proximal Gradient Method ............................................ 293
16.4.3 Alternating Direction Method of Multipliers .......................... 294
16.5 Applications .................................................................... 295
16.5.1 Multivariate Regression with the Group Lasso ....................... 295
16.5.2 Matrix Completion with the Nuclear Norm ........................... 296
16.5.3 Graphical Models with the Lasso ..................................... 297
16.6 Discussion ...................................................................... 298
Acknowledgments ...................................................................... 299
References ............................................................................. 299
16.1 Introduction
In recent years, both the popular press and the scientific community have been focused on
the potential of Big Data. The phrase has been used to refer to the very large amounts
of data that are now being routinely collected by e-commerce sites, molecular biologists,
sociologists, credit card companies, astrophysicists, and more. As computing becomes less
expensive (and, in some cases, as experimental technologies make it possible to measure a
growing number of features), the scale of data being collected across a broad range of fields
will continue to increase at a rapid clip.
From the perspective of a classical statistician, more data are typically better data. After
all, most classical statistical methods and results rely on the assumption that the sample
size is extremely large. If Big Data were only characterized by enormous sample sizes, then
its challenges would be purely computational in nature. However, two aspects inherent to
much of today’s Big Data lead to statistical challenges:
285
286 Handbook of Big Data
1. While the sample size might be large, the number of features for which measure-
ments are available is often much larger. For instance, a web search company
might track every single query made by each of its users. In this setting, the
number of users (the sample size) is huge, but the number of possible queries
(the set of features) is orders of magnitude larger. Similarly, in biology, new
technologies have made it possible to obtain a detailed molecular snapshot of the
activity of a tissue sample or even a single cell; however, this snapshot is expensive
and so typically the sample size is quite small relative to the number of molecular
measurements that are obtained. These examples are high dimensional and there
are many more features than observations.
2. When faced with old-fashioned small data, statisticians often seek to answer very
simple questions, such as Is this parameter nonzero? and Is X correlated with
Y ? However, on the basis of Big Data, we often ask far more complex questions,
such as Which of these 1,000,000 parameters are nonzero? and What are the
conditional dependence relationships among these 30,000 features?
In essence, people ask very complex questions on the basis of Big Data; answering these
questions leads to statistical challenges. Because so much data are available and so many
features have been measured, we become ambitious and seek to fit very complex models.
However, this level of complexity often cannot be supported by the available sample size,
leading to a host of problems, such as overfitting. Even seemingly simple tasks—such as
fitting a linear model to predict a response on the basis of a set of features—can be quite
challenging with Big Data.
In the classical statistical setting, we often estimate a parameter vector θ as the
minimizer of some loss function L(θ), which might be the negative log-likelihood of the data
under some model. In the context of Big Data, reducing the complexity of the parameter
space is of paramount importance. We can do this by making an assumption about the
structure of θ—for instance, that it is sparse or low-rank—and selecting a penalty function
P that encourages this structure in θ. We can then estimate θ as
ˆ
θ, a minimizer of
minimize
θ
{L(θ)+λP(θ)}, (16.1)
where λ is a nonnegative tuning parameter that controls the tradeoff between the loss
function (which encourages the parameter estimate to fit the data) and the penalty function
(which encourages the parameter estimate to conform to our structural assumptions). For
a suitably chosen penalty function P, Equation 16.1 can overcome some of the problems
associated with the complexity of the parameter space, thereby yielding a good estimate
of θ. This makes Equation 16.1 a valuable framework when working with Big Data.
Typically, the loss function L is convex. If the penalty function P is also chosen to be
convex, then Equation 16.1 is a convex optimization problem. This means that every local
optimum is a global optimum and that we have access to a sophisticated set of tools for
finding the minimizer of Equation 16.1 (Boyd and Vandenberghe, 2004). This is particularly
important in the context of Big Data, for which efficient algorithms and implementations
are critical.
In this chapter, we will explore the loss + penalty framework given in Equation 16.1 in
the setting, where both L and P are convex functions. We will consider three motivating
examples: linear regression, matrix completion, and Gaussian graphical modeling. In
Section 16.2, we will explore the convex loss functions associated with these motivating
examples. In Section 16.3, we will introduce some convex penalty functions that can be used
to induce particular types of structure in the parameters. We will explore three algorithms
for solving convex optimization problems in Section 16.4 and will apply them to our three
motivating examples in Section 16.5. Finally, we close with a discussion in Section 16.6.
Penalized Estimation in Complex Models 287
In this chapter, we will not consider the theoretical aspects of the estimators obtained
via the loss + penalty framework discussed here. For a very thorough exposition of these the-
oretical considerations, we refer the interested reader to uhlmann and van de Geer (2011).
We use the following notational conventions. Random variables and random vectors are
capitalized (e.g., A), fixed scalars are in lowercase (e.g., a), fixed vectors are in lowercase
bold (e.g., a), and matrices are in capital bold (e.g., A). For a matrix A, A
j
denotes its
jth column and A
ij
denotes its (i, j)th element. The notation diag(a
i
) indicates a diagonal
matrix with elements a
1
,...,a
p
on the diagonal.
We typically will use Latin letters to represent the data and Greek letters to represent
parameters and optimization variables. We will use θ as the optimization variable for a
generic optimization problem, as in Equation 16.1. The solution to an optimization problem
with optimization variable θ will be denoted using a hat, that is,
ˆ
θ. In the special cases of
linear regression, matrix completion, and the Gaussian graphical model, we will represent
the parameters using β, Γ,andΩ, respectively.
16.2 Loss Functions
In this chapter, we consider convex loss functions, which arise quite often in the context
of complex statistical models. Often the loss function is motivated as the negative log-
likelihood for the data, under some distributional assumptions. Here, we describe three
convex loss functions that arise in Big Data applications. These will serve as illustrative
examples throughout this chapter.
16.2.1 Linear Regression
Consider the model
Y = X
T
β + ,
where Y is a response variable, X =(X
1
,...,X
p
)
T
is a p-vector of features, β =
(β
1
,...,β
p
)
T
is a p-dimensional parameter vector, and is a noise term. Our goal is to
estimate the elements of β on the basis of n independent observations of X and Y .
Let X denote the n × p data matrix and y denote the response vector of length n.The
squared-error loss function takes the form
L(β)=
1
2
y Xβ
2
2
. (16.2)
Minimizing Equation 16.2 leads to the standard least-squares estimator for β.
16.2.2 Matrix Completion
Suppose that we have an m×n data matrix X, which is a noisy observation of some unknown
matrix Γ.Thatis,
X = Γ + E,
where E is an m×n matrix of independent noise terms. Consider the task of estimating Γ in
the setting where only a subset of the elements of X are observed.WeletO⊆{1,...,m
{1,...,n} denote the set of indices of the elements of X that are observed. This is known
as the matrix completion problem and has been extensively investigated in the context of
user recommendation systems (Cai et al., 2010; Mazumder et al., 2010).
288 Handbook of Big Data
We could try to estimate Γ in this setting by minimizing the loss function
L(Γ)=
1
2
(i,j)∈O
(X
ij
Γ
ij
)
2
(16.3)
with respect to Γ. Unfortunately, the solution
ˆ
Γ is trivial and not useful:
ˆ
Γ
ij
equals X
ij
for (i, j) ∈Oand can take on any value for (i, j) /∈O. In effect, the problem is that the
parameter space is too complex given the available data (the elements of the matrix X that
are in the set O).
We will see in Section 16.5 that if we are willing to make certain assumptions about the
structure of Γ—and if we encode those structural assumptions via an appropriate convex
penalty—then the loss function (Equation 16.3) can be successfully used to estimate the
matrix Γ.
16.2.3 Precision Matrix Estimation
Consider a p-dimensional multivariate normal random vector,
X =(X
1
,...,X
p
)
T
N
p
(μ, Σ),
where μ
j
= E[X
j
]andσ
jk
=Cov(X
j
,X
k
). Classical statistical theory tells us that a zero
element of the precision matrix, Ω = Σ
1
, corresponds to a pair of random variables that
are conditionally independent—that is, independent given the other p 2 random variables
(see, e.g., Mardia et al. 1979). Therefore, the set of conditional dependence relationships,
referred to as a graphical model, is given by the sparsity pattern of Ω (see, e.g., Koller and
Friedman 2009).
Let X denote an n × p data matrix, for which the rows represent n independent draws
from the random vector X. Then a natural loss function takes the form
L(Ω)=logdetΩ + trace(), (16.4)
where S is the empirical covariance matrix of X. The loss function (Equation 16.4) is convex
in Ω; notably, it is not convex in Σ.
16.3 Structure-Inducing Convex Penalties
In this section, we introduce three convex penalties that induce three very simple types of
structure. The lasso induces sparsity on the elements of a vector; the group lasso induces
sparsity on groups of elements of a vector; and the nuclear norm induces sparsity on the
singular values of a matrix. Although we discuss only these three penalties in detail, in
Table 16.1, we demonstrate the breadth of structures attainable through convex penalties.
16.3.1 The Lasso
In the setting of Section 16.2.1, suppose that we wish to perform linear regression in such a
way that no more than k of the elements of β are estimated to be nonzero. That is, rather
than minimizing the loss function (16.2) with respect to β, we could consider the problem
minimize
β
1
2
y Xβ
2
2
subject to β
0
k. (16.5)
Penalized Estimation in Complex Models 289
TABLE 16.1
Examples of convex penalties from the recent literature.
Name Structure Example References
Lasso Sparsity
ˆ
β
j
= 0 Tibshirani (1996)
Group lasso (GL) Group sparsity
ˆ
β
g
= 0 Yuan and Lin (2007a)
Hierarchical GL Hierarchical sparsity
ˆ
β
g
=0 =
ˆ
β
h
=0
Zhao et al. (2009)
Overlapping GL Zero unions
ˆ
β
gh
= 0 Jenatton et al. (2011)
Latent overlapping GL Nonzero unions
ˆ
β
gh
= 0 Obozinski et al. (2011a)
Fused lasso Sparse differences
ˆ
β
j
=
ˆ
β
j+1
Tibshirani et al. (2005)
1
trend filtering Sparse discrete
derivatives
ˆ
β
j
ˆ
β
j1
=
ˆ
β
j+1
ˆ
β
j
Kim et al. (2009);
Tibshirani (2014)
Isotonic lasso Monotonicity
ˆ
β
j
ˆ
β
j+1
Tibshirani et al. (2011)
OSCAR Clustered
coefficients
ˆ
β
j
=
ˆ
β
k
Bondell and Reich
(2008)
Nuclear norm Low rank σ
j
(
ˆ
Ω) = 0 Fazel (2002)
Group fusion Identical vectors
ˆ
β γ Hocking et al. (2011),
among others
Generalized lasso Sparse linear
combinations
d
T
ˆ
β = 0 Tibshirani and Taylor
(2011); She (2010)
Note: For each penalty, a description of the structure induced and a reference are provided.
Here β
0
denotes the
0
norm, or the cardinality (number of nonzero elements) of β.
Equation 16.5 is known as best-subsets regression.
Unfortunately, the
0
norm is nonconvex and Equation 16.5 cannot be efficiently solved,
when p is large. To address this problem, we can replace the
0
norm with an
1
norm, to
get a convex relaxation of the above problem:
minimize
β
1
2
y Xβ
2
2
subject to
p
j=1
|β
j
|≤c. (16.6)
For small values of c, the value of β that minimizes Equation 16.6 will tend to be sparse—
that is, to have elements exactly equal to zero. Equation 16.6 is equivalent to (Boyd and
Vandenberghe, 2004)
minimize
β
1
2
y Xβ
2
2
+ λ
p
j=1
|β
j
|
, (16.7)
where λ is some nonnegative tuning parameter. Roughly speaking, as λ increases, the
solution
ˆ
β to Equation 16.7 will tend to become sparser.
The penalty function in Equation 16.7, P(β)=
p
j=1
|β
j
|,isknownasalasso (or
1
)
penalty (Tibshirani, 1996). It has been extensively explored in combination with a number
of loss functions for a variety of problems, such as linear regression, generalized linear
modeling (Friedman et al., 2010), survival analysis (Tibshirani, 1997), graphical modeling
(Yuan and Lin, 2007b), and more (Tibshirani, 2011). We will apply it to graphical modeling
in Section 16.5.3.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.38.92