16. Penalized Estimation in Complex Models (3/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Penalized Estimation in Complex Models 295

Convergence of the ADMM algorithm is guaranteed under some mild conditions

(Eckstein and Bertsekas, 1992). However, ADMM can be slow in practice, in that many

iterations of the updates for θ, α,andu in Algorithm 16.3 may be required for convergence

to a reasonable level of accuracy. Strategies for varying the value of the ρ parameter

in Equation 16.19 have been proposed to reduce the number of iterations required for

convergence (Boyd et al., 2010).

16.5 Applications

In this section, we synthesize the ideas presented thus far by considering three speciﬁc

applications. In each case, we combine a loss function from Section 16.2, a penalty from

Section 16.3, and an algorithm from Section 16.4.

16.5.1 Multivariate Regression with the Group Lasso

Consider a generalization of the linear regression model introduced in Section 16.2.1 to a

multivariate response, Y ∈ R

Y = BX + .

Here B is an m ×p matrix of coeﬃcients with each row corresponding to a response variable

and  ∈ R

is a vector of independent noise terms.

Suppose that we observe n independent draws from this model, giving Y ∈ R

n×m

and

X ∈ R

n×p

. Then a natural loss function is

L(B)=

Y − XB



. (16.21)

By writing y = vec(Y)andβ = vec(B

), we see that this problem can be reexpressed as

in Equation 16.2 of Section 16.2.1, with design matrix I

⊗ X.

Instead of simply estimating B as the minimizer of Equation 16.21, which would be

equivalent to ﬁtting m separate linear regression models, we can apply a penalty that

allows us to borrow strength across the m problems. In machine learning, this sharing of

information across related problems is known as multitask learning. For example, if we

believe that certain predictors are completely irrelevant to all m responses, then we could

impose a group lasso penalty on the columns of B (Argyriou et al., 2008; Obozinski et al.,

2011b),

minimize

⎧

⎨

⎩

Y − XB



+ λ



j=1

B



⎫

⎬

⎭

. (16.22)

As a result of this penalty, the coeﬃcient estimates for the jth feature will equal zero in all

m regression problems or else will be nonzero in all m regression problems.

Because the penalty in Equation 16.22 is separable across the columns of B,wecan

apply BCD to solve this problem. In the tth iteration, the block coordinate update for B

is given by

(t)

∈ arg min



R

−j

− X



+ λB





, (16.23)

where

−j

= Y −



k:k<j



(t)



−



k:k>j



(t−1)



296 Handbook of Big Data

is the jth partial residual matrix. Assume, for ease of notation, that X



=1.Aftersome

algebra, one ﬁnds that the minimizer of Equation 16.23 is unique and that

(t)

=argmin



B

− (R

−j

)



+ λB





From Equations 16.11 and 16.12, the solution to the above is given by

(t)

=max



1 −

(R

−j

)



, 0



−j

)

Algorithm 16.4 displays the BCD approach to solving Equation 16.22. Further details are

provided in Simon et al. (2013a).

Algorithm 16.4 BCD for Multivariate Regression with Group Lasso

Input: X, Y, λ ≥ 0, B

(0)

, t =1.

repeat

for j =1,...,p do

−j

← Y −



k:k<j

(t)

)

−



k:k>j

(t−1)

)

(t)

← max



1 −

(R

−j

)



, 0



−j

)

end

t ← t +1

until convergence;

16.5.2 Matrix Completion with the Nuclear Norm

Consider the matrix completion problem introduced in Section 16.2.2. As noted in that

section, without additional assumptions on the structure of Γ, the observed entries of X do

not provide any information about the unobserved entries of Γ.

A natural structural assumption is that Γ has low rank—that is, a small number of

factors explains the variability in Γ. Ji and Ye (2009), Cai et al. (2010), Mazumder et al.

(2010), Ma et al. (2011), and others combine Equation 16.3 with the nuclear norm penalty

introduced in Section 16.3.3 to yield the optimization problem

minimize

⎧

⎨

⎩



(i,j)∈O

− Γ

)

+ λΓ

∗

⎫

⎬

⎭

. (16.24)

The nuclear norm penalty is not separable in the sense discussed in Section 16.4.1, and

therefore BCD is not a natural choice for solving Equation 16.24. Instead, we apply PGM,

as in Ji and Ye (2009). Because

∂L

∂Γ



− X

if (i, j) ∈O

0otherwise,

it follows that ∇L has Lipschitz constant 1. Thus we can take the step size, s = 1. Therefore,

in the tth iteration, the gradient step takes the form



(t−1)

−∇L(Γ

(t−1)

)





if (i, j) ∈O

(t−1)

if (i, j) ∈O.

(16.25)

Penalized Estimation in Complex Models 297

Algorithm 16.5 PGM for Matrix Completion

Input: X, λ ≥ 0, Γ

(0)

,t=1.

repeat

←



if (i, j) ∈O

(t−1)

if (i, j) ∈O

[U, diag(σ

), V] ← SingularValueDecomposition(Θ)

(t)

← Udiag(S(σ

,λ))V

t ← t +1

until convergence;

Combining this with the nuclear norm proximal operator given in Section 16.3.3 leads to

Algorithm 16.5 for solving Equation 16.24.

The repeated singular value decompositions in Algorithm 16.5 are computationally

burdensome. Mazumder et al. (2010) observe that Γ

(t−1)

−∇L(Γ

(t−1)

) in Equation 16.25 is

the sum of a sparse matrix and a low-rank matrix and they exploit this structure to perform

eﬃcient computations.

16.5.3 Graphical Models with the Lasso

Consider the problem of estimating the conditional dependence relationships among

elements of a multivariate normal random vector. As discussed in Section 16.2.3, this

amounts to identifying the zero elements of the corresponding precision matrix. We can

combine the loss function in Equation 16.4 with an 

penalty to arrive at the convex

optimization problem

minimize

⎧

⎨

⎩

−logdetΩ + trace(SΩ)+λ



i,j

|Ω

⎫

⎬

⎭

. (16.26)

The solution to Equation 16.26 serves as an estimate for Σ

−1

. Equation 16.26 is known as

the graphical lasso and has been studied extensively from a computational (Friedman et al.,

2007; d’Aspremont et al., 2008; Yuan, 2008; Boyd et al., 2010; Scheinberg et al., 2010; Hsieh

et al., 2011; Witten et al., 2011; Mazumder and Hastie, 2012a; Tran-Dinh et al., 2015) and

theoretical (Yuan and Lin, 2007b; Rothman et al., 2008; Ravikumar et al., 2011) perspective.

When λ is suﬃciently large, the resulting estimate for Σ

−1

will be sparse, leading to a sparse

estimate for the conditional independence graph.

We now consider the problem of solving Equation 16.26. BCD has been used to solve

Equation 16.26 (Friedman et al., 2007; d’Aspremont et al., 2008; Mazumder and Hastie,

2012b); however, the resulting algorithm is somewhat complex as each block coordinate

update requires solving a lasso problem. By contrast, an ADMM algorithm for Equa-

tion 16.26 takes a particularly simple and elegant form (Boyd et al., 2010).

To apply ADMM to solve Equation 16.26, we rewrite the problem as

minimize

Ω,



−logdet

Ω +trace(SΩ)+λ



i,j



subject to Ω −

Ω = 0. (16.27)

The augmented Lagrangian (Equation 16.19) then takes the form

(Ω,

Ω, U)=−logdetΩ +trace(SΩ)+λ



i,j

| +trace(U

(Ω −

Ω)) + (ρ/2)Ω −

Ω

298 Handbook of Big Data

To apply Algorithm 16.3, we need to ﬁnd the value of Ω that minimizes L

(Ω,

Ω, U), with

Ω and U held ﬁxed and to ﬁnd the value of

Ω that minimizes L

(Ω,

Ω, U), with Ω and U

held ﬁxed. Completing the square reveals that the second of these problems entails a simple

update involving the soft-thresholding operator (Equation 16.9). To minimize L

(Ω,

Ω, U)

with respect to Ω with

Ω and U held ﬁxed, we note that the solution will satisfy the

optimality condition

−1

− ρΩ = S + U −ρ

Ω.

This means that the eigenvectors of Ω are the same as the eigenvectors of S + U − ρ

and that the eigenvalues of Ω are a simple function of the eigenvalues of S + U −ρ

Ω.This

function, which can be derived via the quadratic formula, takes the form

H(Z)=Vdiag



−e



+4ρ

2ρ



, (16.28)

where Vdiag(e

is the eigendecomposition of (a symmetric matrix) Z. The function H

will be used in Algorithm 16.6.

Details of the ADMM algorithm for solving Equation 16.26 are given in Algorithm 16.6.

The updates for

Ω and U can be performed in O(p

)operations,forS a p × p matrix. By

contrast, the update for Ω requires that we compute the eigen-decomposition of a p × p

matrix; this requires O(p

)operations.

Algorithm 16.6 ADMM for the Graphical Lasso

Input: S  0, λ ≥ 0, ρ>0,

(0)

, U

(0)

, t =1.

repeat

(t)

←H(S + U

(t−1)

− ρ

(t−1)

)[H deﬁned in (16.28)]

(t)

←S(U

(t−1)

+ ρΩ

(t)

,λ)/ρ [S deﬁned in (16.9)]

(t)

← U

(t−1)

+ ρ(Ω

(t)

−

(t)

)

t ← t +1

until convergence;

16.6 Discussion

The growing scale of modern datasets in research and in industry has opened the door to

increasingly ambitious demands of data. It has become routine to ﬁt models that are far

more complex than would ever have been imagined previously. In many areas, the increase

in the complexity of the models considered is in fact outpacing the growth of the sample

sizes. This can lead to severe overﬁtting, unless problem-speciﬁc structural assumptions are

built into the estimation procedure.

In this chapter, we have reviewed a powerful framework for making tractable what

would otherwise be hopelessly underspeciﬁed statistical problems. The key idea is to strike a

balance between good model ﬁt to the data (through the loss function) and model simplicity

(through the penalty function). This tension between ﬂexibility and simplicity is not a new

concept to statisticians, who refer to it as the bias-variance tradeoﬀ (see, e.g., Hastie et al.

2009).

Within the loss + penalty framework of Equation 16.1, the value of λ is used to control

where our estimator

θ lies in this tradeoﬀ. When λ is small, we have a highly ﬂexible model,

which in the case of insuﬃcient observations could be prone to overﬁtting. At the other

Penalized Estimation in Complex Models 299

extreme, when λ is large, we get a model that is less sensitive to the data observed, which

is beneﬁcial when there are not enough observations to support a more complex model.

The optimal choice of the parameter λ depends on many factors that are unknowable to a

data analyst. Conceptually, it depends on how much information the data contain about

the unknown parameter θ, on the choice of penalty, and on the structure possessed by θ.

In practice, cross-validation, AIC, and BIC are often used to select a value of λ (see, e.g.,

Hastie et al. 2009).

In recent years, the theoretical properties of the solution to Equation 16.1 have been

studied extensively. A common thread in this work is that under a certain set of assumptions,

if the complexity of the model is allowed to grow faster than the sample size, then

penalized estimators can work reliably in settings where the classical unpenalized versions

fail completely. Many authors have analyzed Equation 16.1 in the context of speciﬁc

loss functions and speciﬁc penalties (see B¨uhlmann and van de Geer 2011 and references

therein for a comprehensive overview of theoretical results). Negahban et al. (2012) consider

Equation 16.1 in considerable generality.

Throughout this chapter, we have assumed that the loss and penalty functions are

convex. As mentioned earlier, convex formulations are attractive from a computational

standpoint—we do not have to worry about suboptimality of local minima, and we can

apply standard optimization methods, such as those presented in Section 16.4. Beyond this,

the convex framework has an advantage in that a solution

θ to Equation 16.1 has a simple

characterization in terms of a set of optimality conditions (see, e.g., Boyd and Vandenberghe

2004); this facilitates the study of its theoretical properties.

In this chapter, we have ignored a large literature of penalized estimation methods

that make use of nonconvex penalties (Fan and Li, 2001; Zhang, 2010). These methods

yield estimators that closely match the desired problem structure and that have attractive

theoretical properties. However, they pay a price in terms of local minima and diﬃculties

in computation.

In recent years, the success of penalized estimation techniques has led to a rich interplay

between the ﬁelds of statistics and optimization. For example, Agarwal et al. (2012) revisit

the theoretical convergence rate of PGM and show that a faster rate is obtained up to the

level of accuracy that is of interest in a statistical context. Another recent line of work

studies the tradeoﬀs between computational eﬃciency and statistical optimality (Berthet

and Rigollet, 2013; Chandrasekaran and Jordan, 2013; Ma and Wu, 2015).

In this chapter, we have presented a ﬂexible toolkit for ﬁtting complex models to large-

scale data. As the data continue to grow in scale, so too will the complexity of the questions

being asked, and consequently the need for penalized estimation techniques to answer those

questions.

Acknowledgments

JB was supported in part by NSF Award DMS-1405746 and DW was supported in part by

NSF CAREER Award DMS-1252624.

References

Agarwal, A., Negahban, S. and Wainwright, M. J. (2012), Fast global convergence of

gradient methods for high-dimensional statistical recovery, The Annals of Statistics 40(5),

2452–2482.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 16. Penalized Estimation in Complex Models (3/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
16. Penalized Estimation in Complex Models (3/4)