246 Handbook of Big Data
Extensions to multiple dimensions were soon given by Blum (1954). The necessary
conditions for convergence in such cases are the negative definiteness of the Jacobian of
h,orthatH is the stochastic gradient of a function with unique zero (Wei, 1987, Ruppert,
1988b, section 4).
The original proof of Robbins and Monro (1951) is technical but the main idea is
straightforward. Let b
n
E
(θ
n
θ
)
2
denote the squared error of the iterates in
Equation 14.9; then from iteration (Equation 14.9), it follows:
b
n
= b
n1
+2a
n
E ((θ
n1
θ
)h(θ
n1
)) + a
2
n
E
H(θ
n1
)
2
In the neighborhood of θ
, we assume that h(θ
n1
) h
(θ
)(θ
n1
θ
), and thus
b
n
=(1+2a
n
h
(θ
))b
n1
+ a
2
n
E
H(θ
n1
)
2
(14.10)
For a learning rate a
n
= α/n, using typical techniques in stochastic approximation
(Chung, 1954), we can derive from Equation 14.10 that b
n
0. Furthermore, nb
n
α
2
σ
2
(2α|h
(θ
)|−1)
1
,whereσ
2
E
H(θ
)
2
, as shown by several authors (Chung,
1954, Sacks, 1958, Fabian, 1968). Clearly, the learning rate parameter α is critical for
the performance of the Robbins–Monro procedure. Its optimal value is α
=1/h
(θ
),
which requires knowledge of the true parameter value θ
and the slope of h at that
point. This optimality result inspired an important line of research on adaptive stochastic
approximation methods, such as the Venter process (Venter, 1967), in which quantities that
are important for the convergence and efficiency of iterates θ
n
(e.g., the quantity h
(θ
))
are being estimated as the stochastic approximation proceeds.
14.2.1 Sakrison’s Recursive Estimation Method
Although initially motivated by sequential experiment design, the Robbins–Monro proce-
dure was soon modified for statistical estimation. Similar to the estimation setup in Section
14.1, Sakrison (1965) was interested in estimating the parameters θ
of a model that
generated i.i.d. observations (X
n
,Y
n
), in a way that is computationally and statistically
efficient. Sakrison first recognized that one could set H(θ) log (θ; X
n
,Y
n
)inthe
Robbins–Monro procedure (Equation 14.9), and use the identity E ((θ
; X
n
,Y
n
)) = 0
to show why the procedure will converge to the true parameter value θ
.Sakrisons
recursive estimation method was essentially the first explicit SGD method proposed in the
literature:
θ
n
= θ
n1
+ a
n
I(θ
n1
)
1
(θ
n1
; X
n
,Y
n
) (14.11)
where a
n
is a learning rate sequence that satisfies the Robbins–Monro conditions in Section
14.2. The SGD procedure (Equation 14.11) is second order, because it uses the Fisher infor-
mation matrix in addition to the log-likelihood gradient. By the theory of stochastic approxi-
mation, θ
n
θ
, and thus I(θ
n
) →I(θ
). Sakrison (1965) proved that nE
||θ
n
θ
||
2
trace(I(θ
)
1
), which indicates that estimation of θ
is asymptotically optimal, that is, it
achieves the minimum variance of the maximum-likelihood estimator. However, Sakrison’s
method is not computationally efficient, as it requires an expensive matrix inversion at every
iteration. Still, it reveals that the estimation of the Fisher information matrix is essential
for optimal SGD. Adaptive second-order methods leverage this insight to approximate the
Fisher information matrix and improve upon first-order SGD methods.
Stochastic Gradient Methods for Principled Estimation with Large Datasets 247
14.3 Estimation with Stochastic Gradient Methods
We slightly generalize the SGD methods in Section 14.1 through the definitions
θ
sgd
n
= θ
sgd
n1
+ a
n
C(θ
sgd
n1
; X
n
,Y
n
) (14.12)
θ
im
n
= θ
im
n1
+ a
n
C(θ
im
n
; X
n
,Y
n
) (14.13)
where C is symmetric and positive definite, and commutes with I(θ
); adaptive second-order
methods where C is updated at every iteration are discussed in Section 14.3.5.1. The iterate
θ
sgd
n
is the explicit SGD estimator of θ
after the nth data point has been observed; similarly,
θ
im
n
is the implicit SGD estimator of θ
. The total number of data points, denoted by N ,
will be assumed to be practically infinite. We will then compare the asymptotic variance of
those estimators with the variance of the maximum-likelihood estimator on n data points,
which, under typical regularity conditions, has variance
1
n
I(θ
)
1
. The evaluation is done
from a frequentist perspective, that is, across multiple realizations of the dataset up to
n data points D = {(X
1
,Y
1
), (X
2
,Y
2
),...,(X
n
,Y
n
)}, under the same model f and true
parameter value θ
.
Typically, both SGD methods have two phases, namely the exploration phase and the
convergence phase (Amari, 1998, Bottou and Murata, 2002). In the exploration phase, the
iterates approach θ
, whereas in the convergence phase, they jitter around θ
within a
ball of slowly decreasing radius. We will overview a typical analysis of SGD in the final
convergence phase, where a Taylor approximation in the neighborhood of θ
is assumed
accurate (Murata, 1998, Toulis et al., 2014). In particular, let μ(θ)=E ((θ; X
n
,Y
n
)),
and assume
μ(θ
n
)=μ(θ
)+J
μ
(θ
)(θ
n
θ
)+o(a
n
) (14.14)
where:
J
μ
is the Jacobian of the function μ(·)
o(a
n
) denotes a vector sequence with norms of order o(a
n
)
Under typical regularity conditions, μ(θ
)=0andJ
μ
(θ
)=−I(θ
) (Lehmann and Casella,
1998).
14.3.1 Bias and Variance
Denote the biases of the two SGD methods with E(θ
sgd
n
θ
) b
sgd
n
and E(θ
im
n
θ
) b
im
n
.
Then, by taking expectations in Equations 14.12 and 14.13, we obtain the recursions
b
sgd
n
=(I a
n
CI(θ
)) b
sgd
n1
+ o(a
n
) (14.15)
b
im
n
=(I + a
n
CI(θ
))
1
b
im
n1
+ o(a
n
) (14.16)
We observe that convergence—the rate at which the two methods become unbiased in
the limit—differs in the two SGD methods. The explicit SGD method converges faster
than the implicit one because ||(I a
n
CI(θ
))|| < ||(I + a
n
CI(θ
))
1
||, for sufficiently
large n, but the rates become equal in the limit as a
n
0. However, the implicit method
compensates by being more stable in the specification of the learning rate sequence and the
This is an important distinction because, traditionally, the focus in optimization has been to obtain fast
convergence to a parameter value that minimizes the empirical loss, for example, the maximum-likelihood.
From a statistical viewpoint, under variability of the data, there is a tradeoff between convergence to an
estimator and the estimator’s asymptotic variance (Le Cun and Bottou, 2004).
248 Handbook of Big Data
condition matrix C. Loosely speaking, the bias b
im
n
cannot be much worse than b
im
n1
because
(I + a
n
CI(θ
))
1
is a contraction matrix, for any choice of a
n
> 0. Exact nonasymptotic
derivations for the bias of explicit SGD are given by Moulines and Bach (2011), and for the
bias of implicit SGD by Toulis and Airoldi (2015a).
Regarding statistical efficiency, Toulis et al. (2014) showed that, if (2CI(θ
) I/α)is
positive definite, it holds that
nVar(θ
sgd
n
) α
2
(2αCI(θ
) I)
1
CI(θ
)C
nVar(θ
im
n
) α
2
(2αCI(θ
) I)
1
CI(θ
)C
(14.17)
where α = lim
n→∞
na
n
is the learning rate parameter of SGD, as defined in Section 14.1.
Therefore, both SGD methods have the same asymptotic efficiency, which depends on the
learning rate parameter α and the Fisher information matrix I(θ
). Intuitively, the term
(2αCI(θ
) I)
1
in Equation 14.17 is a factor that shows how much information is lost by
the SGD methods. For example, setting C = I(θ
)
1
and α = 1, implies (2αCI(θ
)I)
1
=
I, and the asymptotic variance for both estimators is (1/n)I(θ
)
1
, that is, it is the minimum
variance attainable by the maximum-likelihood estimator. This is exactly Sakrison’s result
presented in Section 14.2.1.
Asymptotic variance results similar to Equation 14.17, but not in the context of model
estimation, were first studied in the stochastic approximation literature by Chung (1954),
Sacks (1958), and followed by Fabian (1968) and several other authors (see also Ljung et al.,
1992, parts I, II), where more general formulas are possible using a Lyapunov equation.
14.3.2 Stability
Stability has been a well-known issue for explicit SGD (Gardner, 1984, Amari et al.,
1997). In practice, the main problem is that the learning rate sequence a
n
needs to agree
with the eigenvalues of the Fisher information matrix I(θ
). To see this, let us simplify
Equations 14.15 and 14.16 by dropping the remainder terms o(a
n
). It follows that
b
sgd
n
=(I a
n
CI(θ
))b
sgd
n1
= P
n
1
b
0
b
im
n
=(I + a
n
CI(θ
))
1
b
im
n1
= Q
n
1
b
0
(14.18)
where P
n
1
=
n
i=1
(I a
i
CI(θ
)), Q
n
1
=
n
i=1
(I + a
i
CI(θ
))
1
,andb
0
denotes the initial
bias of the two procedures from a common starting point θ
0
. The matrices P
n
1
and Q
n
1
describe how fast the initial bias b
0
decays for both SGD methods. For small-to-moderate
n, the two matrices critically affect the stability of SGD methods. For simplicity, we compare
those matrices assuming rate a
n
= α/n and a fixed condition matrix C = I.
Under such assumptions, the eigenvalues of P
n
1
can be calculated as λ
i
=
n
j=1
(1
αλ
i
/j)=O(n
αλ
i
), for 0 < αλ
i
< 1, where λ
i
are the eigenvalues of the Fisher information
matrix I(θ
). Thus, the magnitude of P
n
1
will be dominated by λ
max
,themaximum
eigenvalue of I(θ
), and the rate of convergence to zero will be dominated by λ
min
,the
minimum eigenvalue of I(θ
). For stable eigenvalues λ
i
, the terms in the aforementioned
product need to be less than 1; therefore, it is desirable that |1αλ
max
|≤1 α 2/λ
max
.
For statistical efficiency, it is desirable that (2αI(θ
) I) is positive definite, as shown in
Equation 14.17, and so α > 1/(2λ
min
). In high-dimensional settings, the conditions for
stability and efficiency are hard to satisfy simultaneously, because λ
max
is usually much
larger than λ
min
. Thus, in explicit SGD, a small learning rate can guarantee stability, but
this comes at a price in convergence, which will be at the order of O(n
αλ
min
). On the other
hand, a large learning rate increases the convergence rate but it comes at a price in stability.
Stochastic Gradient Methods for Principled Estimation with Large Datasets 249
In stark contrast, the implicit procedure is unconditionally stable. The eigenvalues of Q
n
1
are λ
i
=
n
j=1
1/(1 + αλ
i
/j)=O(n
αλ
i
), and thus are guaranteed to be less than 1 for any
choice of the learning rate parameter α, because (1 + αλ
i
/j)
1
< 1, for every i and α > 0.
The critical difference with explicit SGD is that it is no longer required to have a small α
for stability because the eigenvalues of Q
n
1
are always less than 1.
Based on this analysis, the magnitude of P
n
1
can become arbitrarily large, and thus
explicit SGD is likely to numerically diverge. In contrast, Q
n
1
is guaranteed to be bounded
and so, under any misspecification of the learning rate parameter, implicit SGD is
guaranteed to remain bounded. The instability of explicit SGD is well known and requires
careful work to be avoided in practice. In the following section, we focus on the related task
of selecting the learning rate sequence.
14.3.3 Choice of Learning Rate Sequence
An interesting observation about the asymptotic variance results (Equation 14.17) is that,
for any choice of the learning rate parameter α, it holds that
α
2
(2αCI(θ
) I)
1
CI(θ
)C
≥I(θ
)
1
(14.19)
where A B indicates that A B is nonnegative definite for two matrices A and B. Hence,
both SGD methods incur an information loss when compared to the maximum-likelihood
estimator, and the loss can be quantified exactly through Equation 14.17. Such information
loss can be avoided if we set C = I(θ
)
1
and α =1.
However, this requires knowledge of
the Fisher information matrix on the true parameters θ
, which are unknown. The Venter
process (Venter, 1967) was the first method to follow an adaptive approach to estimate the
Fisher matrix, and was later analyzed and extended by several other authors (Fabian, 1973,
Lai and Robbins, 1979, Amari et al., 2000, Bottou and Le Cun, 2005). Adaptive methods
that perform an approximation of the matrix I(θ
) (e.g., through a quasi-Newton scheme)
have recently been applied with considerable success (Schraudolph et al., 2007, Bordes et al.,
2009); we review such methods in Section 14.3.5.1.
However, an efficiency loss is generally unavoidable in first-order SGD, that is, when
C = I. In such cases, there is no loss only when the eigenvalues λ
i
of the Fisher information
matrix are identical. When those eigenvalues are distinct, one reasonable way to set the
learning rate parameter α is to minimize the trace of the asymptotic variance matrix in
Equation 14.17, that is, solve
ˆ
α =argmin
α
i
α
2
λ
i
(2αλ
i
1)
(14.20)
under the constraint that α > 1/(2λ
min
), thus making an undesirable but necessary
compromise for convergence in all parameter components. However, the eigenvalues λ
i
are
unknown in practice and need to be estimated from the data. This problem has received
significant attention recently and several methods exist (see Karoui, 2008, and references
within).
Several more options for setting the learning rate are available, due to a voluminous
amount of research literature on learning rate sequences for stochastic approximation and
SGD. In general, the learning rate for explicit SGD should be of the form a
n
= α(αβ+n)
1
.
Parameter α controls the asymptotic variance (see Equation 14.17), and a reasonable choice
Equivalently, we could have a sequence of matrices C
n
that converges to I(θ
)
1
,asinSakrisons
procedure (Sakrison, 1965).
250 Handbook of Big Data
is the solution of Equation 14.20, which requires estimates of the eigenvalues of the Fisher
information matrix I(θ
). A simpler choice is to use α =1/λ
min
,whereλ
min
is the minimum
eigenvalue of I(θ
); the value 1/λ
min
is an approximate solution of Equation 14.20 with
good empirical performance (Xu, 2011, Toulis et al., 2014). Parameter β is used to stabilize
explicit SGD. In particular, it normalizes the learning rate to account for the variance
of the stochastic gradient Var ((θ
n
; X
n
,Y
n
)) = I(θ
)+O(a
n
), for points near θ
.One
reasonable value is β =trace(I(θ
)), which can be estimated easily by summing norms of
the score function, that is,
ˆ
β =
n
i=1
||∇(θ
i1
; X
i
,Y
i
)||
2
. This idea is extended to multiple
dimensions by Amari et al. (2000), Duchi et al. (2011) and Schaul et al. (2012); we discuss
further in Section 14.3.5.1.
For implicit SGD, a learning rate sequence a
n
= α(α+n)
1
works well in practice (Toulis
et al., 2014). As before, α controls statistical efficiency, and we can set α =1/λ
min
,asin
explicit SGD. The additional stability term β of explicit SGD is unnecessary in implicit SGD
because the implicit method performs an indirect normalization of the learning rate—this
is similar to shrinkage described in Equation 14.4.
Eventually, tuning the learning rate sequence depends on problem-specific consider-
ations, and there is a considerable variety of sequences that have been employed in
practice (George and Powell, 2006, Schaul et al., 2012). Principled design of learning rates
in first-order SGD remains an important research topic; for example, recent work has
investigated variance reduction techniques (Johnson and Zhang, 2013, Wang et al., 2013), or
even constant learning rates for least-squares models (Bach and Moulines, 2013). Second-
order methods that essentially maintain multiple learning rates, one for each parameter
component, are discussed in Section 14.3.5.1.
14.3.4 Efficient Computation of Implicit Methods
The update in implicit SGD (Equation 14.3) is a p-dimensional fixed-point equation, which
is generally hard to solve. However, in many statistical models, Equation 14.3 can be
reduced to a one-dimensional fixed-point equation, which can be computed very fast using
a numerical root-finding method.
Consider a linear statistical model where (θ; X
n
,Y
n
) depends on θ only through
the linear term X
n
θ. A large family of models satisfy this condition: generalized linear
models, generalized additive models, proportional hazards, etc. We denote (θ; X
n
,Y
n
)=
g
n
(X
n
θ), where we suppressed the dependence of g on X
n
,Y
n
in the subscript n. Then,
(θ; X
n
,Y
n
)=g
n
(X
n
θ)X
n
, and therefore the direction of the gradient of the log-likelihood
is parameter free. It follows that the implicit procedure can be written as
θ
im
n
= θ
im
n1
+ a
n
λ
n
(θ
im
n1
; X
n
,Y
n
) (14.21)
where the gradient is now calculated at the previous estimate θ
im
n1
and λ
n
is an appropriate
scaling. We now derive λ
n
by combining the definition of implicit SGD (Equations 14.3
and 14.21):
θ
im
n1
+ a
n
λ
n
(θ
im
n1
; X
n
,Y
n
)=θ
im
n1
+ a
n
(θ
im
n
; X
n
,Y
n
)
λ
n
g
n
(X
n
θ
im
n1
)=g
n
(X
n
θ
im
n
) (14.22)
Using Equation 14.21 in 14.22, we get
λ
n
=
g
n
X
n
θ
im
n1
+ a
n
λ
n
||X
n
||
2
g
n
(X
n
θ
im
n1
)
g
n
(X
n
θ
im
n1
)
(14.23)
Equation 14.23 is a one-dimensional fixed-point equation with respect to λ
n
.Thus,the
implicit iterate θ
im
n
of Equation 14.3 can be efficiently computed by first obtaining λ
n
from
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.144.56