Penalized Estimation in Complex Models 295
Convergence of the ADMM algorithm is guaranteed under some mild conditions
(Eckstein and Bertsekas, 1992). However, ADMM can be slow in practice, in that many
iterations of the updates for θ, α,andu in Algorithm 16.3 may be required for convergence
to a reasonable level of accuracy. Strategies for varying the value of the ρ parameter
in Equation 16.19 have been proposed to reduce the number of iterations required for
convergence (Boyd et al., 2010).
16.5 Applications
In this section, we synthesize the ideas presented thus far by considering three specific
applications. In each case, we combine a loss function from Section 16.2, a penalty from
Section 16.3, and an algorithm from Section 16.4.
16.5.1 Multivariate Regression with the Group Lasso
Consider a generalization of the linear regression model introduced in Section 16.2.1 to a
multivariate response, Y R
m
:
Y = BX + .
Here B is an m ×p matrix of coefficients with each row corresponding to a response variable
and R
m
is a vector of independent noise terms.
Suppose that we observe n independent draws from this model, giving Y R
n×m
and
X R
n×p
. Then a natural loss function is
L(B)=
1
2
Y XB
T
2
F
. (16.21)
By writing y = vec(Y)andβ = vec(B
T
), we see that this problem can be reexpressed as
in Equation 16.2 of Section 16.2.1, with design matrix I
m
X.
Instead of simply estimating B as the minimizer of Equation 16.21, which would be
equivalent to fitting m separate linear regression models, we can apply a penalty that
allows us to borrow strength across the m problems. In machine learning, this sharing of
information across related problems is known as multitask learning. For example, if we
believe that certain predictors are completely irrelevant to all m responses, then we could
impose a group lasso penalty on the columns of B (Argyriou et al., 2008; Obozinski et al.,
2011b),
minimize
B
1
2
Y XB
T
2
F
+ λ
p
j=1
B
j
2
. (16.22)
As a result of this penalty, the coefficient estimates for the jth feature will equal zero in all
m regression problems or else will be nonzero in all m regression problems.
Because the penalty in Equation 16.22 is separable across the columns of B,wecan
apply BCD to solve this problem. In the tth iteration, the block coordinate update for B
j
is given by
B
(t)
j
arg min
B
j
1
2
R
j
X
j
B
T
j
2
F
+ λB
j
2
, (16.23)
where
R
j
= Y
k:k<j
X
k
B
(t)
k
T
k:k>j
X
k
B
(t1)
k
T
296 Handbook of Big Data
is the jth partial residual matrix. Assume, for ease of notation, that X
j
2
=1.Aftersome
algebra, one finds that the minimizer of Equation 16.23 is unique and that
B
(t)
j
=argmin
B
j
1
2
B
j
(R
j
)
T
X
j
2
2
+ λB
j
2
.
From Equations 16.11 and 16.12, the solution to the above is given by
B
(t)
j
=max
1
λ
(R
j
)
T
X
j
2
, 0
(R
j
)
T
X
j
.
Algorithm 16.4 displays the BCD approach to solving Equation 16.22. Further details are
provided in Simon et al. (2013a).
Algorithm 16.4 BCD for Multivariate Regression with Group Lasso
Input: X, Y, λ 0, B
(0)
, t =1.
repeat
for j =1,...,p do
R
j
Y
k:k<j
X
k
(B
(t)
k
)
T
k:k>j
X
k
(B
(t1)
k
)
T
B
(t)
j
max
1
λ
(R
j
)
T
X
j
2
, 0
(R
j
)
T
X
j
end
t t +1
until convergence;
16.5.2 Matrix Completion with the Nuclear Norm
Consider the matrix completion problem introduced in Section 16.2.2. As noted in that
section, without additional assumptions on the structure of Γ, the observed entries of X do
not provide any information about the unobserved entries of Γ.
A natural structural assumption is that Γ has low rank—that is, a small number of
factors explains the variability in Γ. Ji and Ye (2009), Cai et al. (2010), Mazumder et al.
(2010), Ma et al. (2011), and others combine Equation 16.3 with the nuclear norm penalty
introduced in Section 16.3.3 to yield the optimization problem
minimize
Γ
1
2
(i,j)∈O
(X
ij
Γ
ij
)
2
+ λΓ
. (16.24)
The nuclear norm penalty is not separable in the sense discussed in Section 16.4.1, and
therefore BCD is not a natural choice for solving Equation 16.24. Instead, we apply PGM,
as in Ji and Ye (2009). Because
L
Γ
ij
=
Γ
ij
X
ij
if (i, j) ∈O
0otherwise,
it follows that ∇L has Lipschitz constant 1. Thus we can take the step size, s = 1. Therefore,
in the tth iteration, the gradient step takes the form
Γ
(t1)
−∇L(Γ
(t1)
)
ij
=
X
ij
if (i, j) ∈O
Γ
(t1)
ij
if (i, j) ∈O.
(16.25)
Penalized Estimation in Complex Models 297
Algorithm 16.5 PGM for Matrix Completion
Input: X, λ 0, Γ
(0)
,t=1.
repeat
Θ
ij
X
ij
if (i, j) ∈O
Γ
(t1)
ij
if (i, j) ∈O
[U, diag(σ
j
), V] SingularValueDecomposition(Θ)
Γ
(t)
Udiag(S(σ
j
))V
T
t t +1
until convergence;
Combining this with the nuclear norm proximal operator given in Section 16.3.3 leads to
Algorithm 16.5 for solving Equation 16.24.
The repeated singular value decompositions in Algorithm 16.5 are computationally
burdensome. Mazumder et al. (2010) observe that Γ
(t1)
−∇L(Γ
(t1)
) in Equation 16.25 is
the sum of a sparse matrix and a low-rank matrix and they exploit this structure to perform
efficient computations.
16.5.3 Graphical Models with the Lasso
Consider the problem of estimating the conditional dependence relationships among
elements of a multivariate normal random vector. As discussed in Section 16.2.3, this
amounts to identifying the zero elements of the corresponding precision matrix. We can
combine the loss function in Equation 16.4 with an
1
penalty to arrive at the convex
optimization problem
minimize
Ω
logdetΩ + trace()+λ
i,j
|Ω
ij
|
. (16.26)
The solution to Equation 16.26 serves as an estimate for Σ
1
. Equation 16.26 is known as
the graphical lasso and has been studied extensively from a computational (Friedman et al.,
2007; d’Aspremont et al., 2008; Yuan, 2008; Boyd et al., 2010; Scheinberg et al., 2010; Hsieh
et al., 2011; Witten et al., 2011; Mazumder and Hastie, 2012a; Tran-Dinh et al., 2015) and
theoretical (Yuan and Lin, 2007b; Rothman et al., 2008; Ravikumar et al., 2011) perspective.
When λ is sufficiently large, the resulting estimate for Σ
1
will be sparse, leading to a sparse
estimate for the conditional independence graph.
We now consider the problem of solving Equation 16.26. BCD has been used to solve
Equation 16.26 (Friedman et al., 2007; d’Aspremont et al., 2008; Mazumder and Hastie,
2012b); however, the resulting algorithm is somewhat complex as each block coordinate
update requires solving a lasso problem. By contrast, an ADMM algorithm for Equa-
tion 16.26 takes a particularly simple and elegant form (Boyd et al., 2010).
To apply ADMM to solve Equation 16.26, we rewrite the problem as
minimize
Ω,
˜
Ω
logdet
Ω +trace(SΩ)+λ
i,j
|
˜
Ω
ij
|
subject to Ω
˜
Ω = 0. (16.27)
The augmented Lagrangian (Equation 16.19) then takes the form
L
ρ
(Ω,
˜
Ω, U)=logdetΩ +trace()+λ
i,j
|
˜
Ω
ij
| +trace(U
T
(Ω
˜
Ω)) + (ρ/2)Ω
˜
Ω
2
F
.
298 Handbook of Big Data
To apply Algorithm 16.3, we need to find the value of Ω that minimizes L
ρ
(Ω,
˜
Ω, U), with
˜
Ω and U held fixed and to find the value of
˜
Ω that minimizes L
ρ
(Ω,
˜
Ω, U), with Ω and U
held fixed. Completing the square reveals that the second of these problems entails a simple
update involving the soft-thresholding operator (Equation 16.9). To minimize L
ρ
(Ω,
˜
Ω, U)
with respect to Ω with
˜
Ω and U held fixed, we note that the solution will satisfy the
optimality condition
Ω
1
ρΩ = S + U ρ
˜
Ω.
This means that the eigenvectors of Ω are the same as the eigenvectors of S + U ρ
˜
Ω
and that the eigenvalues of Ω are a simple function of the eigenvalues of S + U ρ
˜
Ω.This
function, which can be derived via the quadratic formula, takes the form
H(Z)=Vdiag
e
i
+
e
2
i
+4ρ
2ρ
V
T
, (16.28)
where Vdiag(e
i
)V
T
is the eigendecomposition of (a symmetric matrix) Z. The function H
will be used in Algorithm 16.6.
Details of the ADMM algorithm for solving Equation 16.26 are given in Algorithm 16.6.
The updates for
˜
Ω and U can be performed in O(p
2
)operations,forS a p × p matrix. By
contrast, the update for Ω requires that we compute the eigen-decomposition of a p × p
matrix; this requires O(p
3
)operations.
Algorithm 16.6 ADMM for the Graphical Lasso
Input: S 0, λ 0, ρ>0,
˜
Ω
(0)
, U
(0)
, t =1.
repeat
Ω
(t)
←H(S + U
(t1)
ρ
˜
Ω
(t1)
)[H defined in (16.28)]
˜
Ω
(t)
←S(U
(t1)
+ ρΩ
(t)
) [S defined in (16.9)]
U
(t)
U
(t1)
+ ρ(Ω
(t)
˜
Ω
(t)
)
t t +1
until convergence;
16.6 Discussion
The growing scale of modern datasets in research and in industry has opened the door to
increasingly ambitious demands of data. It has become routine to fit models that are far
more complex than would ever have been imagined previously. In many areas, the increase
in the complexity of the models considered is in fact outpacing the growth of the sample
sizes. This can lead to severe overfitting, unless problem-specific structural assumptions are
built into the estimation procedure.
In this chapter, we have reviewed a powerful framework for making tractable what
would otherwise be hopelessly underspecified statistical problems. The key idea is to strike a
balance between good model fit to the data (through the loss function) and model simplicity
(through the penalty function). This tension between flexibility and simplicity is not a new
concept to statisticians, who refer to it as the bias-variance tradeoff (see, e.g., Hastie et al.
2009).
Within the loss + penalty framework of Equation 16.1, the value of λ is used to control
where our estimator
ˆ
θ lies in this tradeoff. When λ is small, we have a highly flexible model,
which in the case of insufficient observations could be prone to overfitting. At the other
Penalized Estimation in Complex Models 299
extreme, when λ is large, we get a model that is less sensitive to the data observed, which
is beneficial when there are not enough observations to support a more complex model.
The optimal choice of the parameter λ depends on many factors that are unknowable to a
data analyst. Conceptually, it depends on how much information the data contain about
the unknown parameter θ, on the choice of penalty, and on the structure possessed by θ.
In practice, cross-validation, AIC, and BIC are often used to select a value of λ (see, e.g.,
Hastie et al. 2009).
In recent years, the theoretical properties of the solution to Equation 16.1 have been
studied extensively. A common thread in this work is that under a certain set of assumptions,
if the complexity of the model is allowed to grow faster than the sample size, then
penalized estimators can work reliably in settings where the classical unpenalized versions
fail completely. Many authors have analyzed Equation 16.1 in the context of specific
loss functions and specific penalties (see B¨uhlmann and van de Geer 2011 and references
therein for a comprehensive overview of theoretical results). Negahban et al. (2012) consider
Equation 16.1 in considerable generality.
Throughout this chapter, we have assumed that the loss and penalty functions are
convex. As mentioned earlier, convex formulations are attractive from a computational
standpoint—we do not have to worry about suboptimality of local minima, and we can
apply standard optimization methods, such as those presented in Section 16.4. Beyond this,
the convex framework has an advantage in that a solution
ˆ
θ to Equation 16.1 has a simple
characterization in terms of a set of optimality conditions (see, e.g., Boyd and Vandenberghe
2004); this facilitates the study of its theoretical properties.
In this chapter, we have ignored a large literature of penalized estimation methods
that make use of nonconvex penalties (Fan and Li, 2001; Zhang, 2010). These methods
yield estimators that closely match the desired problem structure and that have attractive
theoretical properties. However, they pay a price in terms of local minima and difficulties
in computation.
In recent years, the success of penalized estimation techniques has led to a rich interplay
between the fields of statistics and optimization. For example, Agarwal et al. (2012) revisit
the theoretical convergence rate of PGM and show that a faster rate is obtained up to the
level of accuracy that is of interest in a statistical context. Another recent line of work
studies the tradeoffs between computational efficiency and statistical optimality (Berthet
and Rigollet, 2013; Chandrasekaran and Jordan, 2013; Ma and Wu, 2015).
In this chapter, we have presented a flexible toolkit for fitting complex models to large-
scale data. As the data continue to grow in scale, so too will the complexity of the questions
being asked, and consequently the need for penalized estimation techniques to answer those
questions.
Acknowledgments
JB was supported in part by NSF Award DMS-1405746 and DW was supported in part by
NSF CAREER Award DMS-1252624.
References
Agarwal, A., Negahban, S. and Wainwright, M. J. (2012), Fast global convergence of
gradient methods for high-dimensional statistical recovery, The Annals of Statistics 40(5),
2452–2482.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.183.210