14. Stochastic Gradient Methods for Principled Estimation with Large Datasets (2/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

246 Handbook of Big Data

Extensions to multiple dimensions were soon given by Blum (1954). The necessary

conditions for convergence in such cases are the negative deﬁniteness of the Jacobian of

h,orthatH is the stochastic gradient of a function with unique zero (Wei, 1987, Ruppert,

1988b, section 4).

The original proof of Robbins and Monro (1951) is technical but the main idea is

straightforward. Let b

 E



(θ

− θ



)



denote the squared error of the iterates in

Equation 14.9; then from iteration (Equation 14.9), it follows:

= b

n−1

+2a

E ((θ

n−1

− θ



)h(θ

n−1

)) + a



H(θ

n−1

)



In the neighborhood of θ



, we assume that h(θ

n−1

) ≈ h



(θ



)(θ

n−1

− θ



), and thus

=(1+2a



(θ



))b

n−1

+ a



H(θ

n−1

)



(14.10)

For a learning rate a

= α/n, using typical techniques in stochastic approximation

(Chung, 1954), we can derive from Equation 14.10 that b

→ 0. Furthermore, nb

→

(2α|h



(θ



)|−1)

−1

,whereσ

 E



H(θ



)



, as shown by several authors (Chung,

1954, Sacks, 1958, Fabian, 1968). Clearly, the learning rate parameter α is critical for

the performance of the Robbins–Monro procedure. Its optimal value is α



=1/h



(θ



which requires knowledge of the true parameter value θ



and the slope of h at that

point. This optimality result inspired an important line of research on adaptive stochastic

approximation methods, such as the Venter process (Venter, 1967), in which quantities that

are important for the convergence and eﬃciency of iterates θ

(e.g., the quantity h



(θ



))

are being estimated as the stochastic approximation proceeds.

14.2.1 Sakrison’s Recursive Estimation Method

Although initially motivated by sequential experiment design, the Robbins–Monro proce-

dure was soon modiﬁed for statistical estimation. Similar to the estimation setup in Section

14.1, Sakrison (1965) was interested in estimating the parameters θ



of a model that

generated i.i.d. observations (X

), in a way that is computationally and statistically

eﬃcient. Sakrison ﬁrst recognized that one could set H(θ)  ∇log (θ; X

)inthe

Robbins–Monro procedure (Equation 14.9), and use the identity E (∇(θ



; X

)) = 0

to show why the procedure will converge to the true parameter value θ



.Sakrison’s

recursive estimation method was essentially the ﬁrst explicit SGD method proposed in the

literature:

= θ

n−1

+ a

I(θ

n−1

)

−1

∇(θ

n−1

; X

) (14.11)

where a

is a learning rate sequence that satisﬁes the Robbins–Monro conditions in Section

14.2. The SGD procedure (Equation 14.11) is second order, because it uses the Fisher infor-

mation matrix in addition to the log-likelihood gradient. By the theory of stochastic approxi-

mation, θ

→ θ



, and thus I(θ

) →I(θ



). Sakrison (1965) proved that nE



||θ

− θ





→

trace(I(θ



)

−1

), which indicates that estimation of θ



is asymptotically optimal, that is, it

achieves the minimum variance of the maximum-likelihood estimator. However, Sakrison’s

method is not computationally eﬃcient, as it requires an expensive matrix inversion at every

iteration. Still, it reveals that the estimation of the Fisher information matrix is essential

for optimal SGD. Adaptive second-order methods leverage this insight to approximate the

Fisher information matrix and improve upon ﬁrst-order SGD methods.

Stochastic Gradient Methods for Principled Estimation with Large Datasets 247

14.3 Estimation with Stochastic Gradient Methods

We slightly generalize the SGD methods in Section 14.1 through the deﬁnitions

sgd

= θ

sgd

n−1

+ a

C∇(θ

sgd

n−1

; X

) (14.12)

= θ

n−1

+ a

C∇(θ

; X

) (14.13)

where C is symmetric and positive deﬁnite, and commutes with I(θ



); adaptive second-order

methods where C is updated at every iteration are discussed in Section 14.3.5.1. The iterate

sgd

is the explicit SGD estimator of θ



after the nth data point has been observed; similarly,

is the implicit SGD estimator of θ



. The total number of data points, denoted by N ,

will be assumed to be practically inﬁnite. We will then compare the asymptotic variance of

those estimators with the variance of the maximum-likelihood estimator on n data points,

which, under typical regularity conditions, has variance

I(θ



)

−1

. The evaluation is done

from a frequentist perspective, that is, across multiple realizations of the dataset up to

n data points D = {(X

), (X

),...,(X

)}, under the same model f and true

parameter value θ



∗

Typically, both SGD methods have two phases, namely the exploration phase and the

convergence phase (Amari, 1998, Bottou and Murata, 2002). In the exploration phase, the

iterates approach θ



, whereas in the convergence phase, they jitter around θ



within a

ball of slowly decreasing radius. We will overview a typical analysis of SGD in the ﬁnal

convergence phase, where a Taylor approximation in the neighborhood of θ



is assumed

accurate (Murata, 1998, Toulis et al., 2014). In particular, let μ(θ)=E (∇(θ; X

)),

and assume

μ(θ

)=μ(θ



)+J

(θ



)(θ

− θ



)+o(a

) (14.14)

where:

is the Jacobian of the function μ(·)

o(a

) denotes a vector sequence with norms of order o(a

)

Under typical regularity conditions, μ(θ



)=0andJ

(θ



)=−I(θ



) (Lehmann and Casella,

1998).

14.3.1 Bias and Variance

Denote the biases of the two SGD methods with E(θ

sgd

−θ



)  b

sgd

and E(θ

−θ



)  b

Then, by taking expectations in Equations 14.12 and 14.13, we obtain the recursions

sgd

=(I − a

CI(θ



)) b

sgd

n−1

+ o(a

) (14.15)

=(I + a

CI(θ



))

−1

n−1

+ o(a

) (14.16)

We observe that convergence—the rate at which the two methods become unbiased in

the limit—diﬀers in the two SGD methods. The explicit SGD method converges faster

than the implicit one because ||(I − a

CI(θ



))|| < ||(I + a

CI(θ



))

−1

||, for suﬃciently

large n, but the rates become equal in the limit as a

→ 0. However, the implicit method

compensates by being more stable in the speciﬁcation of the learning rate sequence and the

∗

This is an important distinction because, traditionally, the focus in optimization has been to obtain fast

convergence to a parameter value that minimizes the empirical loss, for example, the maximum-likelihood.

From a statistical viewpoint, under variability of the data, there is a tradeoﬀ between convergence to an

estimator and the estimator’s asymptotic variance (Le Cun and Bottou, 2004).

248 Handbook of Big Data

condition matrix C. Loosely speaking, the bias b

cannot be much worse than b

n−1

because

(I + a

CI(θ



))

−1

is a contraction matrix, for any choice of a

> 0. Exact nonasymptotic

derivations for the bias of explicit SGD are given by Moulines and Bach (2011), and for the

bias of implicit SGD by Toulis and Airoldi (2015a).

Regarding statistical eﬃciency, Toulis et al. (2014) showed that, if (2CI(θ



) − I/α)is

positive deﬁnite, it holds that

nVar(θ

sgd

) → α

(2αCI(θ



) − I)

−1

CI(θ



nVar(θ

) → α

(2αCI(θ



) − I)

−1

CI(θ



(14.17)

where α = lim

n→∞

is the learning rate parameter of SGD, as deﬁned in Section 14.1.

Therefore, both SGD methods have the same asymptotic eﬃciency, which depends on the

learning rate parameter α and the Fisher information matrix I(θ



). Intuitively, the term

(2αCI(θ



) −I)

−1

in Equation 14.17 is a factor that shows how much information is lost by

the SGD methods. For example, setting C = I(θ



)

−1

and α = 1, implies (2αCI(θ



)−I)

−1

I, and the asymptotic variance for both estimators is (1/n)I(θ



)

−1

, that is, it is the minimum

variance attainable by the maximum-likelihood estimator. This is exactly Sakrison’s result

presented in Section 14.2.1.

Asymptotic variance results similar to Equation 14.17, but not in the context of model

estimation, were ﬁrst studied in the stochastic approximation literature by Chung (1954),

Sacks (1958), and followed by Fabian (1968) and several other authors (see also Ljung et al.,

1992, parts I, II), where more general formulas are possible using a Lyapunov equation.

14.3.2 Stability

Stability has been a well-known issue for explicit SGD (Gardner, 1984, Amari et al.,

1997). In practice, the main problem is that the learning rate sequence a

needs to agree

with the eigenvalues of the Fisher information matrix I(θ



). To see this, let us simplify

Equations 14.15 and 14.16 by dropping the remainder terms o(a

). It follows that

sgd

=(I − a

CI(θ



))b

sgd

n−1

= P

=(I + a

CI(θ



))

−1

n−1

= Q

(14.18)

where P



i=1

(I − a

CI(θ



)), Q



i=1

(I + a

CI(θ



))

−1

,andb

denotes the initial

bias of the two procedures from a common starting point θ

. The matrices P

and Q

describe how fast the initial bias b

decays for both SGD methods. For small-to-moderate

n, the two matrices critically aﬀect the stability of SGD methods. For simplicity, we compare

those matrices assuming rate a

= α/n and a ﬁxed condition matrix C = I.

Under such assumptions, the eigenvalues of P

can be calculated as λ





j=1

(1 −

αλ

/j)=O(n

−αλ

), for 0 < αλ

< 1, where λ

are the eigenvalues of the Fisher information

matrix I(θ



). Thus, the magnitude of P

will be dominated by λ

max

,themaximum

eigenvalue of I(θ



), and the rate of convergence to zero will be dominated by λ

min

,the

minimum eigenvalue of I(θ



). For stable eigenvalues λ



, the terms in the aforementioned

product need to be less than 1; therefore, it is desirable that |1−αλ

max

|≤1 ⇒ α ≤ 2/λ

max

For statistical eﬃciency, it is desirable that (2αI(θ



) − I) is positive deﬁnite, as shown in

Equation 14.17, and so α > 1/(2λ

min

). In high-dimensional settings, the conditions for

stability and eﬃciency are hard to satisfy simultaneously, because λ

max

is usually much

larger than λ

min

. Thus, in explicit SGD, a small learning rate can guarantee stability, but

this comes at a price in convergence, which will be at the order of O(n

−αλ

min

). On the other

hand, a large learning rate increases the convergence rate but it comes at a price in stability.

Stochastic Gradient Methods for Principled Estimation with Large Datasets 249

In stark contrast, the implicit procedure is unconditionally stable. The eigenvalues of Q

are λ





j=1

1/(1 + αλ

/j)=O(n

−αλ

), and thus are guaranteed to be less than 1 for any

choice of the learning rate parameter α, because (1 + αλ

/j)

−1

< 1, for every i and α > 0.

The critical diﬀerence with explicit SGD is that it is no longer required to have a small α

for stability because the eigenvalues of Q

are always less than 1.

Based on this analysis, the magnitude of P

can become arbitrarily large, and thus

explicit SGD is likely to numerically diverge. In contrast, Q

is guaranteed to be bounded

and so, under any misspeciﬁcation of the learning rate parameter, implicit SGD is

guaranteed to remain bounded. The instability of explicit SGD is well known and requires

careful work to be avoided in practice. In the following section, we focus on the related task

of selecting the learning rate sequence.

14.3.3 Choice of Learning Rate Sequence

An interesting observation about the asymptotic variance results (Equation 14.17) is that,

for any choice of the learning rate parameter α, it holds that

(2αCI(θ



) − I)

−1

CI(θ



≥I(θ



)

−1

(14.19)

where A ≥ B indicates that A −B is nonnegative deﬁnite for two matrices A and B. Hence,

both SGD methods incur an information loss when compared to the maximum-likelihood

estimator, and the loss can be quantiﬁed exactly through Equation 14.17. Such information

loss can be avoided if we set C = I(θ



)

−1

and α =1.

∗

However, this requires knowledge of

the Fisher information matrix on the true parameters θ



, which are unknown. The Venter

process (Venter, 1967) was the ﬁrst method to follow an adaptive approach to estimate the

Fisher matrix, and was later analyzed and extended by several other authors (Fabian, 1973,

Lai and Robbins, 1979, Amari et al., 2000, Bottou and Le Cun, 2005). Adaptive methods

that perform an approximation of the matrix I(θ



) (e.g., through a quasi-Newton scheme)

have recently been applied with considerable success (Schraudolph et al., 2007, Bordes et al.,

2009); we review such methods in Section 14.3.5.1.

However, an eﬃciency loss is generally unavoidable in ﬁrst-order SGD, that is, when

C = I. In such cases, there is no loss only when the eigenvalues λ

of the Fisher information

matrix are identical. When those eigenvalues are distinct, one reasonable way to set the

learning rate parameter α is to minimize the trace of the asymptotic variance matrix in

Equation 14.17, that is, solve

α =argmin



(2αλ

− 1)

(14.20)

under the constraint that α > 1/(2λ

min

), thus making an undesirable but necessary

compromise for convergence in all parameter components. However, the eigenvalues λ

are

unknown in practice and need to be estimated from the data. This problem has received

signiﬁcant attention recently and several methods exist (see Karoui, 2008, and references

within).

Several more options for setting the learning rate are available, due to a voluminous

amount of research literature on learning rate sequences for stochastic approximation and

SGD. In general, the learning rate for explicit SGD should be of the form a

= α(αβ+n)

−1

Parameter α controls the asymptotic variance (see Equation 14.17), and a reasonable choice

∗

Equivalently, we could have a sequence of matrices C

that converges to I(θ



)

−1

,asinSakrison’s

procedure (Sakrison, 1965).

250 Handbook of Big Data

is the solution of Equation 14.20, which requires estimates of the eigenvalues of the Fisher

information matrix I(θ



). A simpler choice is to use α =1/λ

min

,whereλ

min

is the minimum

eigenvalue of I(θ



); the value 1/λ

min

is an approximate solution of Equation 14.20 with

good empirical performance (Xu, 2011, Toulis et al., 2014). Parameter β is used to stabilize

explicit SGD. In particular, it normalizes the learning rate to account for the variance

of the stochastic gradient Var (∇(θ

; X

)) = I(θ



)+O(a

), for points near θ



.One

reasonable value is β =trace(I(θ



)), which can be estimated easily by summing norms of

the score function, that is,

β =



i=1

||∇(θ

i−1

; X

)||

. This idea is extended to multiple

dimensions by Amari et al. (2000), Duchi et al. (2011) and Schaul et al. (2012); we discuss

further in Section 14.3.5.1.

For implicit SGD, a learning rate sequence a

= α(α+n)

−1

works well in practice (Toulis

et al., 2014). As before, α controls statistical eﬃciency, and we can set α =1/λ

min

,asin

explicit SGD. The additional stability term β of explicit SGD is unnecessary in implicit SGD

because the implicit method performs an indirect normalization of the learning rate—this

is similar to shrinkage described in Equation 14.4.

Eventually, tuning the learning rate sequence depends on problem-speciﬁc consider-

ations, and there is a considerable variety of sequences that have been employed in

practice (George and Powell, 2006, Schaul et al., 2012). Principled design of learning rates

in ﬁrst-order SGD remains an important research topic; for example, recent work has

investigated variance reduction techniques (Johnson and Zhang, 2013, Wang et al., 2013), or

even constant learning rates for least-squares models (Bach and Moulines, 2013). Second-

order methods that essentially maintain multiple learning rates, one for each parameter

component, are discussed in Section 14.3.5.1.

14.3.4 Eﬃcient Computation of Implicit Methods

The update in implicit SGD (Equation 14.3) is a p-dimensional ﬁxed-point equation, which

is generally hard to solve. However, in many statistical models, Equation 14.3 can be

reduced to a one-dimensional ﬁxed-point equation, which can be computed very fast using

a numerical root-ﬁnding method.

Consider a linear statistical model where (θ; X

) depends on θ only through

the linear term X



θ. A large family of models satisfy this condition: generalized linear

models, generalized additive models, proportional hazards, etc. We denote (θ; X



θ), where we suppressed the dependence of g on X

in the subscript n. Then,

∇(θ; X

)=g





θ)X

, and therefore the direction of the gradient of the log-likelihood

is parameter free. It follows that the implicit procedure can be written as

= θ

n−1

+ a

∇(θ

n−1

; X

) (14.21)

where the gradient is now calculated at the previous estimate θ

n−1

and λ

is an appropriate

scaling. We now derive λ

by combining the deﬁnition of implicit SGD (Equations 14.3

and 14.21):

n−1

+ a

∇(θ

n−1

; X

)=θ

n−1

+ a

∇(θ

; X

) ⇒





n−1

)=g





) (14.22)

Using Equation 14.21 in 14.22, we get







n−1

+ a

||X





n−1

)







n−1

)

(14.23)

Equation 14.23 is a one-dimensional ﬁxed-point equation with respect to λ

.Thus,the

implicit iterate θ

of Equation 14.3 can be eﬃciently computed by ﬁrst obtaining λ

from

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14. Stochastic Gradient Methods for Principled Estimation with Large Datasets (2/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
14. Stochastic Gradient Methods for Principled Estimation with Large Datasets (2/6)