16. Penalized Estimation in Complex Models (1/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Penalized Estimation in Complex Models

Jacob Bien and Daniela Witten

CONTENTS

16.1 Introduction .................................................................... 285

16.2 Loss Functions ................................................................. 287

16.2.1 Linear Regression ...................................................... 287

16.2.2 Matrix Completion .................................................... 287

16.2.3 Precision Matrix Estimation .......................................... 288

16.3 Structure-Inducing Convex Penalties .......................................... 288

16.3.1 The Lasso .............................................................. 288

16.3.2 The Group Lasso ...................................................... 290

16.3.3 The Nuclear Norm ..................................................... 291

16.4 Algorithms ..................................................................... 291

16.4.1 Blockwise Coordinate Descent ........................................ 292

16.4.2 Proximal Gradient Method ............................................ 293

16.4.3 Alternating Direction Method of Multipliers .......................... 294

16.5 Applications .................................................................... 295

16.5.1 Multivariate Regression with the Group Lasso ....................... 295

16.5.2 Matrix Completion with the Nuclear Norm ........................... 296

16.5.3 Graphical Models with the Lasso ..................................... 297

16.6 Discussion ...................................................................... 298

Acknowledgments ...................................................................... 299

References ............................................................................. 299

16.1 Introduction

In recent years, both the popular press and the scientiﬁc community have been focused on

the potential of Big Data. The phrase has been used to refer to the very large amounts

of data that are now being routinely collected by e-commerce sites, molecular biologists,

sociologists, credit card companies, astrophysicists, and more. As computing becomes less

expensive (and, in some cases, as experimental technologies make it possible to measure a

growing number of features), the scale of data being collected across a broad range of ﬁelds

will continue to increase at a rapid clip.

From the perspective of a classical statistician, more data are typically better data. After

all, most classical statistical methods and results rely on the assumption that the sample

size is extremely large. If Big Data were only characterized by enormous sample sizes, then

its challenges would be purely computational in nature. However, two aspects inherent to

much of today’s Big Data lead to statistical challenges:

285

286 Handbook of Big Data

1. While the sample size might be large, the number of features for which measure-

ments are available is often much larger. For instance, a web search company

might track every single query made by each of its users. In this setting, the

number of users (the sample size) is huge, but the number of possible queries

(the set of features) is orders of magnitude larger. Similarly, in biology, new

technologies have made it possible to obtain a detailed molecular snapshot of the

activity of a tissue sample or even a single cell; however, this snapshot is expensive

and so typically the sample size is quite small relative to the number of molecular

measurements that are obtained. These examples are high dimensional and there

are many more features than observations.

2. When faced with old-fashioned small data, statisticians often seek to answer very

simple questions, such as Is this parameter nonzero? and Is X correlated with

Y ? However, on the basis of Big Data, we often ask far more complex questions,

such as Which of these 1,000,000 parameters are nonzero? and What are the

conditional dependence relationships among these 30,000 features?

In essence, people ask very complex questions on the basis of Big Data; answering these

questions leads to statistical challenges. Because so much data are available and so many

features have been measured, we become ambitious and seek to ﬁt very complex models.

However, this level of complexity often cannot be supported by the available sample size,

leading to a host of problems, such as overﬁtting. Even seemingly simple tasks—such as

ﬁtting a linear model to predict a response on the basis of a set of features—can be quite

challenging with Big Data.

In the classical statistical setting, we often estimate a parameter vector θ as the

minimizer of some loss function L(θ), which might be the negative log-likelihood of the data

under some model. In the context of Big Data, reducing the complexity of the parameter

space is of paramount importance. We can do this by making an assumption about the

structure of θ—for instance, that it is sparse or low-rank—and selecting a penalty function

P that encourages this structure in θ. We can then estimate θ as

θ, a minimizer of

minimize

{L(θ)+λP(θ)}, (16.1)

where λ is a nonnegative tuning parameter that controls the tradeoﬀ between the loss

function (which encourages the parameter estimate to ﬁt the data) and the penalty function

(which encourages the parameter estimate to conform to our structural assumptions). For

a suitably chosen penalty function P, Equation 16.1 can overcome some of the problems

associated with the complexity of the parameter space, thereby yielding a good estimate

of θ. This makes Equation 16.1 a valuable framework when working with Big Data.

Typically, the loss function L is convex. If the penalty function P is also chosen to be

convex, then Equation 16.1 is a convex optimization problem. This means that every local

optimum is a global optimum and that we have access to a sophisticated set of tools for

ﬁnding the minimizer of Equation 16.1 (Boyd and Vandenberghe, 2004). This is particularly

important in the context of Big Data, for which eﬃcient algorithms and implementations

are critical.

In this chapter, we will explore the loss + penalty framework given in Equation 16.1 in

the setting, where both L and P are convex functions. We will consider three motivating

examples: linear regression, matrix completion, and Gaussian graphical modeling. In

Section 16.2, we will explore the convex loss functions associated with these motivating

examples. In Section 16.3, we will introduce some convex penalty functions that can be used

to induce particular types of structure in the parameters. We will explore three algorithms

for solving convex optimization problems in Section 16.4 and will apply them to our three

motivating examples in Section 16.5. Finally, we close with a discussion in Section 16.6.

Penalized Estimation in Complex Models 287

In this chapter, we will not consider the theoretical aspects of the estimators obtained

via the loss + penalty framework discussed here. For a very thorough exposition of these the-

oretical considerations, we refer the interested reader to B¨uhlmann and van de Geer (2011).

We use the following notational conventions. Random variables and random vectors are

capitalized (e.g., A), ﬁxed scalars are in lowercase (e.g., a), ﬁxed vectors are in lowercase

bold (e.g., a), and matrices are in capital bold (e.g., A). For a matrix A, A

denotes its

jth column and A

denotes its (i, j)th element. The notation diag(a

) indicates a diagonal

matrix with elements a

,...,a

on the diagonal.

We typically will use Latin letters to represent the data and Greek letters to represent

parameters and optimization variables. We will use θ as the optimization variable for a

generic optimization problem, as in Equation 16.1. The solution to an optimization problem

with optimization variable θ will be denoted using a hat, that is,

θ. In the special cases of

linear regression, matrix completion, and the Gaussian graphical model, we will represent

the parameters using β, Γ,andΩ, respectively.

16.2 Loss Functions

In this chapter, we consider convex loss functions, which arise quite often in the context

of complex statistical models. Often the loss function is motivated as the negative log-

likelihood for the data, under some distributional assumptions. Here, we describe three

convex loss functions that arise in Big Data applications. These will serve as illustrative

examples throughout this chapter.

16.2.1 Linear Regression

Consider the model

Y = X

β + ,

where Y is a response variable, X =(X

,...,X

)

is a p-vector of features, β =

(β

,...,β

)

is a p-dimensional parameter vector, and  is a noise term. Our goal is to

estimate the elements of β on the basis of n independent observations of X and Y .

Let X denote the n × p data matrix and y denote the response vector of length n.The

squared-error loss function takes the form

L(β)=

y −Xβ

. (16.2)

Minimizing Equation 16.2 leads to the standard least-squares estimator for β.

16.2.2 Matrix Completion

Suppose that we have an m×n data matrix X, which is a noisy observation of some unknown

matrix Γ.Thatis,

X = Γ + E,

where E is an m×n matrix of independent noise terms. Consider the task of estimating Γ in

the setting where only a subset of the elements of X are observed.WeletO⊆{1,...,m}×

{1,...,n} denote the set of indices of the elements of X that are observed. This is known

as the matrix completion problem and has been extensively investigated in the context of

user recommendation systems (Cai et al., 2010; Mazumder et al., 2010).

288 Handbook of Big Data

We could try to estimate Γ in this setting by minimizing the loss function

L(Γ)=



(i,j)∈O

− Γ

)

(16.3)

with respect to Γ. Unfortunately, the solution

Γ is trivial and not useful:

equals X

for (i, j) ∈Oand can take on any value for (i, j) /∈O. In eﬀect, the problem is that the

parameter space is too complex given the available data (the elements of the matrix X that

are in the set O).

We will see in Section 16.5 that if we are willing to make certain assumptions about the

structure of Γ—and if we encode those structural assumptions via an appropriate convex

penalty—then the loss function (Equation 16.3) can be successfully used to estimate the

matrix Γ.

16.2.3 Precision Matrix Estimation

Consider a p-dimensional multivariate normal random vector,

X =(X

,...,X

)

∼ N

(μ, Σ),

where μ

= E[X

]andσ

=Cov(X

). Classical statistical theory tells us that a zero

element of the precision matrix, Ω = Σ

−1

, corresponds to a pair of random variables that

are conditionally independent—that is, independent given the other p −2 random variables

(see, e.g., Mardia et al. 1979). Therefore, the set of conditional dependence relationships,

referred to as a graphical model, is given by the sparsity pattern of Ω (see, e.g., Koller and

Friedman 2009).

Let X denote an n × p data matrix, for which the rows represent n independent draws

from the random vector X. Then a natural loss function takes the form

L(Ω)=−logdetΩ + trace(SΩ), (16.4)

where S is the empirical covariance matrix of X. The loss function (Equation 16.4) is convex

in Ω; notably, it is not convex in Σ.

16.3 Structure-Inducing Convex Penalties

In this section, we introduce three convex penalties that induce three very simple types of

structure. The lasso induces sparsity on the elements of a vector; the group lasso induces

sparsity on groups of elements of a vector; and the nuclear norm induces sparsity on the

singular values of a matrix. Although we discuss only these three penalties in detail, in

Table 16.1, we demonstrate the breadth of structures attainable through convex penalties.

16.3.1 The Lasso

In the setting of Section 16.2.1, suppose that we wish to perform linear regression in such a

way that no more than k of the elements of β are estimated to be nonzero. That is, rather

than minimizing the loss function (16.2) with respect to β, we could consider the problem

minimize



y −Xβ



subject to β

≤ k. (16.5)

Penalized Estimation in Complex Models 289

TABLE 16.1

Examples of convex penalties from the recent literature.

Name Structure Example References

Lasso Sparsity

= 0 Tibshirani (1996)

Group lasso (GL) Group sparsity

= 0 Yuan and Lin (2007a)

Hierarchical GL Hierarchical sparsity

=0 =⇒

Zhao et al. (2009)

Overlapping GL Zero unions

g∪h

= 0 Jenatton et al. (2011)

Latent overlapping GL Nonzero unions

g∪h

= 0 Obozinski et al. (2011a)

Fused lasso Sparse diﬀerences

j+1

Tibshirani et al. (2005)



trend ﬁltering Sparse discrete

derivatives

−

j−1

j+1

−

Kim et al. (2009);

Tibshirani (2014)

Isotonic lasso Monotonicity

≤

j+1

Tibshirani et al. (2011)

OSCAR Clustered

coeﬃcients

Bondell and Reich

(2008)

Nuclear norm Low rank σ

(

Ω) = 0 Fazel (2002)

Group fusion Identical vectors

β =ˆγ Hocking et al. (2011),

among others

Generalized lasso Sparse linear

combinations

β = 0 Tibshirani and Taylor

(2011); She (2010)

Note: For each penalty, a description of the structure induced and a reference are provided.

Here β

denotes the 

norm, or the cardinality (number of nonzero elements) of β.

Equation 16.5 is known as best-subsets regression.

Unfortunately, the 

norm is nonconvex and Equation 16.5 cannot be eﬃciently solved,

when p is large. To address this problem, we can replace the 

norm with an 

norm, to

get a convex relaxation of the above problem:

minimize



y − Xβ



subject to



j=1

|β

|≤c. (16.6)

For small values of c, the value of β that minimizes Equation 16.6 will tend to be sparse—

that is, to have elements exactly equal to zero. Equation 16.6 is equivalent to (Boyd and

Vandenberghe, 2004)

minimize

⎧

⎨

⎩

y −Xβ

+ λ



j=1

|β

⎫

⎬

⎭

, (16.7)

where λ is some nonnegative tuning parameter. Roughly speaking, as λ increases, the

solution

β to Equation 16.7 will tend to become sparser.

The penalty function in Equation 16.7, P(β)=



j=1

|β

|,isknownasalasso (or 

)

penalty (Tibshirani, 1996). It has been extensively explored in combination with a number

of loss functions for a variety of problems, such as linear regression, generalized linear

modeling (Friedman et al., 2010), survival analysis (Tibshirani, 1997), graphical modeling

(Yuan and Lin, 2007b), and more (Tibshirani, 2011). We will apply it to graphical modeling

in Section 16.5.3.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 16. Penalized Estimation in Complex Models (1/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
16. Penalized Estimation in Complex Models (1/4)