Stochastic Gradient Methods for Principled Estimation with Large Datasets 249
In stark contrast, the implicit procedure is unconditionally stable. The eigenvalues of Q
n
1
are λ
i
=
n
j=1
1/(1 + αλ
i
/j)=O(n
−αλ
i
), and thus are guaranteed to be less than 1 for any
choice of the learning rate parameter α, because (1 + αλ
i
/j)
−1
< 1, for every i and α > 0.
The critical difference with explicit SGD is that it is no longer required to have a small α
for stability because the eigenvalues of Q
n
1
are always less than 1.
Based on this analysis, the magnitude of P
n
1
can become arbitrarily large, and thus
explicit SGD is likely to numerically diverge. In contrast, Q
n
1
is guaranteed to be bounded
and so, under any misspecification of the learning rate parameter, implicit SGD is
guaranteed to remain bounded. The instability of explicit SGD is well known and requires
careful work to be avoided in practice. In the following section, we focus on the related task
of selecting the learning rate sequence.
14.3.3 Choice of Learning Rate Sequence
An interesting observation about the asymptotic variance results (Equation 14.17) is that,
for any choice of the learning rate parameter α, it holds that
α
2
(2αCI(θ
) − I)
−1
CI(θ
)C
≥I(θ
)
−1
(14.19)
where A ≥ B indicates that A −B is nonnegative definite for two matrices A and B. Hence,
both SGD methods incur an information loss when compared to the maximum-likelihood
estimator, and the loss can be quantified exactly through Equation 14.17. Such information
loss can be avoided if we set C = I(θ
)
−1
and α =1.
∗
However, this requires knowledge of
the Fisher information matrix on the true parameters θ
, which are unknown. The Venter
process (Venter, 1967) was the first method to follow an adaptive approach to estimate the
Fisher matrix, and was later analyzed and extended by several other authors (Fabian, 1973,
Lai and Robbins, 1979, Amari et al., 2000, Bottou and Le Cun, 2005). Adaptive methods
that perform an approximation of the matrix I(θ
) (e.g., through a quasi-Newton scheme)
have recently been applied with considerable success (Schraudolph et al., 2007, Bordes et al.,
2009); we review such methods in Section 14.3.5.1.
However, an efficiency loss is generally unavoidable in first-order SGD, that is, when
C = I. In such cases, there is no loss only when the eigenvalues λ
i
of the Fisher information
matrix are identical. When those eigenvalues are distinct, one reasonable way to set the
learning rate parameter α is to minimize the trace of the asymptotic variance matrix in
Equation 14.17, that is, solve
ˆ
α =argmin
α
i
α
2
λ
i
(2αλ
i
− 1)
(14.20)
under the constraint that α > 1/(2λ
min
), thus making an undesirable but necessary
compromise for convergence in all parameter components. However, the eigenvalues λ
i
are
unknown in practice and need to be estimated from the data. This problem has received
significant attention recently and several methods exist (see Karoui, 2008, and references
within).
Several more options for setting the learning rate are available, due to a voluminous
amount of research literature on learning rate sequences for stochastic approximation and
SGD. In general, the learning rate for explicit SGD should be of the form a
n
= α(αβ+n)
−1
.
Parameter α controls the asymptotic variance (see Equation 14.17), and a reasonable choice
∗
Equivalently, we could have a sequence of matrices C
n
that converges to I(θ
)
−1
,asinSakrison’s
procedure (Sakrison, 1965).