14. Stochastic Gradient Methods for Principled Estimation with Large Datasets (4/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

256 Handbook of Big Data

By the theory of stochastic approximation, procedure 14.40 converges to the observed

data maximum-likelihood estimate

θ. In contrast, procedure 14.39 will not converge with a

constant α; it will rather reach a point in the vicinity of

θ more rapidly than Equation 14.40

and then oscillate around

θ. Further online EM algorithms have been developed by several

authors (Neal and Hinton, 1998, Capp´e and Moulines, 2009). Examples of a growing body

of applications of such methods can be found in works by Neal and Hinton (1998), Sato and

Ishii (2000), Liu et al. (2006), and Capp´e (2011).

14.4.2 MCMC Sampling

As before, we need to slightly extend our notation to a Bayesian setting. Let θ denote model

parameters with an assumed prior distribution π(θ).AcommontaskinBayesian inference

is to sample from the posterior distribution f (θ|Y) ∝ π(θ)f(Y|θ), given N observed data

points Y = {Y

,...,Y

The Hamiltonian Monte Carlo (HMC) (Neal, 2011) is an MCMC method in which

auxiliary parameters p are introduced to improve sampling from f(θ|Y). In the augmented

parameter space, we consider a function H(θ,p)=U (θ)+K(p) ∈ R

,whereU(θ)=

−log f(θ|Y)andK(p)=(1/2)p



Mp, M being positive deﬁnite. Next, we consider the

density

h(θ,p|Y)=exp{−H(θ,p)} =exp{−U(θ) − K(p)} = f(θ|Y) ×N(p, M

−1

)

In this parameterization, the variables p are independent of θ. Assuming an initial state

(θ

), sampling with HMC proceeds in iterations indexed by n =1,..., as follows:

1. Sample p

∗

∼N(0,M

−1

2. Using Hamiltonian dynamics, compute (θ

)=ODE(θ

n−1

∗

3. Perform a Metropolis–Hastings step for the proposed transition (θ

n−1

∗

) → (θ

)

with acceptance probability min[1, exp(−H(θ

)+H(θ

n−1

∗

))].

Step 2 is the key idea in HMC. The parameters (θ,p) are mapped to a physical system,

where θ is the position of the system and p is the momentum. The potential of the physical

system is U (θ) and its kinetic energy is K(p). Function H is known as the Hamiltonian.The

Hamiltonian dynamics refer to a set of ordinary diﬀerential equations (ODE) that govern

the movement of the system, and thus determine the future values of (θ,p) given a pair of

current values. Being a closed physical system, the Hamiltonian of the system, H(θ,p)=

U(θ)+K(p), is constant. Thus, in Step 3 of HMC it holds that −H(θ

)+H(θ

n−1

∗

)=0,

and thus the acceptance probability is 1, assuming that the solution of the ODE is exact.

This is a signiﬁcant improvement over generic Metropolis–Hastings, where it is usually hard

to achieve high acceptance probabilities.

A special case of HMC, known as Langevin dynamics (Girolami and Calderhead, 2011),

deﬁnes the sampling iterations as follows:

∼N(0, I)

= θ

n−1



(∇log π(θ

n−1

)+∇log f(Y|θ

n−1

)) + η

(14.41)

The sampling procedure 14.41 follows from HMC by a numerical solution of the ODE

in Step 2 of the algorithm using the leapfrog method (Neal, 2011). Parameter  > 0in

Equation 14.41 determines the size of the leapfrog in the numerical solution of Hamiltonian

diﬀerential equations.

Stochastic Gradient Methods for Principled Estimation with Large Datasets 257

Welling and Teh (2011) studied a simple modiﬁcation of Langevin dynamics (Equa-

tion 14.41) using a stochastic gradient as follows:

∼N(0, 

)

= θ

n−1



(∇log π(θ

n−1

)+(N/b)



i∈batch

∇log f(Y

|θ

n−1

)) + η

(14.42)

The step sizes 

> 0 satisfy the typical Robbins–Monro requirements, that is,





= ∞

and





< ∞. Procedure 14.42 is using stochastic gradients averaged over a batch of b

data points, a technique usually employed in SGD to reduce noise in stochastic gradients.

Sato and Nakagawa (2014) proved that procedure 14.42 converges to the true posterior

f(θ|Y) using an elegant theory of stochastic calculus. Sampling through stochastic gradient

Langevin dynamics has since generated a lot of related work in posterior sampling for large

datasets, and it is still a rapidly expanding research area with contributions from various

disciplines (Hoﬀman et al., 2013, Korattikara et al., 2014, Pillai and Smith, 2014).

14.4.3 Reinforcement Learning

Reinforcement learning is the multidisciplinary study of how autonomous agents perceive,

learn, and interact with their environment (Bertsekas and Tsitsiklis, 1995). Typically, it

is assumed that time t proceeds in discrete steps, and at every step an agent is at state

∈X,whereX is the state space. On entering a state x

,twothingshappen.First,an

agent receives a probabilistic reward R(x

) ∈ R, and, second, the agent takes an action

a ∈A,whereA denotes the action space. This action is determined by the agent’s policy,

which is a function π : X→A, mapping a state to an action. Nature then decides a

transition to state x

t+1

according to a probability that is unknown to the agent.

One important task in reinforcement learning is to estimate the value function V

(x),

which quantiﬁes the expected value of a speciﬁc state x ∈Xwith respect to policy π,

deﬁned as

(x)=E (R(x)) + γE (R(x

)) + γ

E (R(x

)) + ··· (14.43)

where:

denotes the state that will be reached starting at x after t transitions

γ ∈ (0, 1) is a parameter that discounts future rewards

Uncertainty in R(x

) includes the uncertainty of the state x

because of the stochasticity in

state transitions, and the uncertainty from the reward distribution. Thus, V

(x)admitsa

recursive deﬁnition as follows:

(x)=E (R(x)) + γE (V

)) (14.44)

When the state is a high-dimensional vector, one popular approach is to use a linear

approximation for V (x), such that V (x)=θ



φ(x), where φ(x) maps a state to a feature

space with fewer dimensions and θ



is a vector of ﬁxed parameters. If the agent is at state

, then the recursive equation (Equation 14.44) can be rewritten as

E (R(x

) − (θ



− γθ



t+1

)|φ

) = 0 (14.45)

wherewesetφ

= φ(x

) for notational convenience. Similar to SGD, this suggests a

stochastic approximation method to estimate θ



through the following iteration:

t+1

= θ

+ a

[R(x

) − (θ



− γθ



t+1

)] φ

(14.46)

258 Handbook of Big Data

where a

is a learning rate sequence that satisﬁes the Robbins–Monro conditions of Section

14.2. Procedure 14.46 is known as the temporal diﬀerences (TD) learning algorithm (Sutton,

1988). Implicit versions of this algorithm have recently emerged to solve the known stability

issues of the classical TD algorithm (Wang and Bertsekas, 2013, Tamar et al., 2014). For

example, Tamar et al. (2014) consider computing the term θ



at the future iterate, the

resulting implicit TD algorithm being deﬁned as

t+1

=(I + a



)

−1

[θ

+ a

(R(x

)+γθ



t+1

)φ

] (14.47)

Similar to implicit SGD, iteration 14.47 stabilizes the TD iteration 14.46. With the advent

of online multiagent markets, methods and applications in reinforcement learning have been

receiving a renewed stream of research eﬀort (Gosavi, 2009).

14.4.4 Deep Learning

Deep learning is the task of estimating parameters of statistical models that can be

represented by multiple layers of nonlinear operations, such as neural networks (Bengio,

2009). Such models, also referred to as deep architectures, consist of units that can perform

a basic prediction task, and are grouped in layers such that the output of one layer forms

the input of another layer that sits directly on top. Furthermore, the models are usually

augmented with latent units that are deﬁned to represent structured quantities of interest,

such as edges or shapes in an image.

One basic building block of deep architectures is the Restricted Boltzmann Machine

(RBM). The complete-data density for one data point (X, Y ) of the states of hidden and

observed input units, respectively, is given by

f(X,Y ; θ)=

exp{−b



Y −c



x − X



WY}

Z(θ)

(14.48)

where θ =(b, c, W ) are the model parameters, and the function Z(θ)=



X,Y

exp{−b



Y −



x − X



WY}, also known as the partition function, acts as the normalizing constant.

Furthermore, the sample spaces for X and Y are discrete (e.g., binary) and ﬁnite. The

observed-data density is thus f(Y ; θ)=



f(X,Y ; θ). Let H(X, Y ; θ)=b



Y + c



x +



WY, such that f (X, Y ; θ)=(e

−H(X,Y ;θ)

)/(Z(θ)). Also consider the observed data Y =

,...,Y

} and missing data X = {X

,...,X

Through simple algebra, one can obtain the gradient of the log-likelihood of observed

data in the following convenient form:

∇(θ; Y)=−[E (∇H(X, Y; θ)) − E (∇H(X, Y; θ)|Y)] (14.49)

where H(X, Y; θ)=



n=1

H(X

; θ). In practical situations, the data points (X

)

are binary. Therefore, the conditional distribution of the missing data X

is readily

available through a logistic regression model, and thus the second term of Equation 14.49

is easy to sample from. Similarly, Y

is easy to sample from. However, the ﬁrst term in

Equation 14.49 requires sampling from the joint distribution of the complete data (X, Y),

which conceptually is easy to do using the aforementioned conditionals and a Gibbs sampling

scheme (Geman and Geman, 1984). However, the domain for both X and Y is typically

very large, for example, it comprises thousands or millions of units, and thus a full Gibbs

on the joint distribution is impossible.

The method of contrastive divergence (Hinton, 2002, Carreira-Perpinan and Hinton,

2005) has been applied for training such models with considerable success. The algorithm

proceeds as follows for steps i =1, 2,...:

Stochastic Gradient Methods for Principled Estimation with Large Datasets 259

1. Sample one state Y

(i)

from the empirical distribution of observed data Y.

2. Sample X

(i)

, that is, the hidden state.

3. Sample Y

(i,new)

(i)

4. Sample X

(i,new)

5. Evaluate the gradient (Equation 14.49) using (X

(i)

) for the second term and the

sample (X

(i,new)

)fortheﬁrstterm.

6. Update the parameters in θ using constant-step-size SGD and the estimated gradient

from Step 5.

In other words, contrastive divergence attempts to estimate ∇(θ; Y) in Equation 14.49.

This estimation is biased because (X

(i,new)

)isassumedtobefromtheexactjoint

distribution of (X, Y ); however, they are single Gibbs iterations starting from the observed

and imputed data (X

(i)

), respectively. In theory, Steps 3 and 4 could be repeated

k times; for example, if k →∞the sampling distribution of (X

(i,new)

)would

be the exact joint distribution of (X, Y ), leading to unbiased estimation of ∇(θ; Y)of

Equation 14.49. Surprisingly, it has been empirically observed that k = 1 is enough for

good performance in many learning tasks (Hinton, 2002, Taylor et al., 2006, Salakhutdinov

et al., 2007, Bengio, 2009, Bengio and Delalleau, 2009), which is a testament to the power

and ﬂexibility of stochastic gradient methods.

14.5 Glossary

SGD: Stochastic gradient descent.

References

Amari, S.-I. (1998). Natural gradient works eﬃciently in learning. Neural Computation,

10(2):251–276.

Amari, S.-I., Chen, T.-P., and Cichocki, A. (1997). Stability analysis of learning algorithms

for blind source separation. Neural Networks, 10(8):1345–1351.

Amari, S.-I., Park, H., and Fukumizu, K. (2000). Adaptive method of realizing natural

gradient learning for multilayer perceptrons. Neural Computation, 12(6):1399–1409.

Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation

with convergence rate o(1/n). In Advances in Neural Information Processing Systems,

pp. 773–781.

Bather, J. (1989). Stochastic Approximation: A Generalisation of the Robbins-Monro

Procedure, volume 89. Mathematical Sciences Institute, Cornell University, Ithaca, NY.

Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine

Learning, 2(1):1–127.

260 Handbook of Big Data

Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence.

Neural Computation, 21(6):1601–1621.

Bertsekas, D. P. (2011). Incremental proximal methods for large scale convex optimization.

Mathematical Programming, 129(2):163–195.

Bertsekas, D. P. and Tsitsiklis, J. N. (1995). Neuro-dynamic programming: An overview. In

Proceedings of the 34th IEEE Conference on Decision and Control, volume 1, pp. 560–564.

IEEE, New Orleans, LA.

Blum, J. R. (1954). Multidimensional stochastic approximation methods. The Annals of

Mathematical Statistics, 25:737–744.

Bordes, A., Bottou, L., and Gallinari, P. (2009). SGD-QN: Careful quasi-Newton stochastic

gradient descent. The Journal of Machine Learning Research, 10:1737–1754.

Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In

Lechevallier, Y. and Saporta, G. (eds.), Proceedings of COMPSTAT, pp. 177–186.

Springer, Berlin, Germany.

Bottou, L. and Le Cun, Y. (2005). Online learning for very large datasets. Applied Stochastic

Models in Business and Industry, 21(2):137–151.

Bottou, L. and Murata, N. (2002). Stochastic approximations and eﬃcient learning.

The Handbook of Brain Theory and Neural Networks, 2nd edition. The MIT Press,

Cambridge, MA.

Broyden, C. G. (1965). A class of methods for solving nonlinear simultaneous equations.

Mathematics of Computation, 19:577–593.

Capp´e, O. (2011). Online EM algorithm for Hidden Markov models. Journal of Computa-

tional and Graphical Statistics, 20(3):728–749.

Capp´e, O. and Moulines, E. (2009). Online Expectation–Maximization algorithm for latent

data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology),

71(3):593–613.

Carreira-Perpinan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning. In

Cowell, R. and Ghahramani, Z. (eds.), Proceedings of the 10th International Workshop

on Artiﬁcial Intelligence and Statistics, pp. 33–40. The Society for Artiﬁcial Intelligence

and Statistics.

Chung, K. L. (1954). On a stochastic approximation method. The Annals of Mathematical

Statistics, 25:463–483.

Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data

via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38.

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online

learning and stochastic optimization. The Journal of Machine Learning Research,

12:2121–2159.

Dupuis, P. and Simha, R. (1991). On sampling controlled stochastic approximation. IEEE

Transactions on Automatic Control, 36(8):915–924.

Fabian, V. (1968). On asymptotic normality in stochastic approximation. The Annals of

Mathematical Statistics, 39:1327–1332.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14. Stochastic Gradient Methods for Principled Estimation with Large Datasets (4/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
14. Stochastic Gradient Methods for Principled Estimation with Large Datasets (4/6)