256 Handbook of Big Data
By the theory of stochastic approximation, procedure 14.40 converges to the observed
data maximum-likelihood estimate
ˆ
θ. In contrast, procedure 14.39 will not converge with a
constant α; it will rather reach a point in the vicinity of
ˆ
θ more rapidly than Equation 14.40
and then oscillate around
ˆ
θ. Further online EM algorithms have been developed by several
authors (Neal and Hinton, 1998, Capp´e and Moulines, 2009). Examples of a growing body
of applications of such methods can be found in works by Neal and Hinton (1998), Sato and
Ishii (2000), Liu et al. (2006), and Capp´e (2011).
14.4.2 MCMC Sampling
As before, we need to slightly extend our notation to a Bayesian setting. Let θ denote model
parameters with an assumed prior distribution π(θ).AcommontaskinBayesian inference
is to sample from the posterior distribution f (θ|Y) π(θ)f(Y|θ), given N observed data
points Y = {Y
1
,...,Y
N
}.
The Hamiltonian Monte Carlo (HMC) (Neal, 2011) is an MCMC method in which
auxiliary parameters p are introduced to improve sampling from f(θ|Y). In the augmented
parameter space, we consider a function H(θ,p)=U (θ)+K(p) R
+
,whereU(θ)=
log f(θ|Y)andK(p)=(1/2)p
Mp, M being positive definite. Next, we consider the
density
h(θ,p|Y)=exp{−H(θ,p)} =exp{−U(θ) K(p)} = f(θ|Y) ×N(p, M
1
)
In this parameterization, the variables p are independent of θ. Assuming an initial state
(θ
0
,p
0
), sampling with HMC proceeds in iterations indexed by n =1,..., as follows:
1. Sample p
∼N(0,M
1
).
2. Using Hamiltonian dynamics, compute (θ
n
,p
n
)=ODE(θ
n1
,p
).
3. Perform a Metropolis–Hastings step for the proposed transition (θ
n1
,p
) (θ
n
,p
n
)
with acceptance probability min[1, exp(H(θ
n
,p
n
)+H(θ
n1
,p
))].
Step 2 is the key idea in HMC. The parameters (θ,p) are mapped to a physical system,
where θ is the position of the system and p is the momentum. The potential of the physical
system is U (θ) and its kinetic energy is K(p). Function H is known as the Hamiltonian.The
Hamiltonian dynamics refer to a set of ordinary differential equations (ODE) that govern
the movement of the system, and thus determine the future values of (θ,p) given a pair of
current values. Being a closed physical system, the Hamiltonian of the system, H(θ,p)=
U(θ)+K(p), is constant. Thus, in Step 3 of HMC it holds that H(θ
n
,p
n
)+H(θ
n1
,p
)=0,
and thus the acceptance probability is 1, assuming that the solution of the ODE is exact.
This is a significant improvement over generic Metropolis–Hastings, where it is usually hard
to achieve high acceptance probabilities.
A special case of HMC, known as Langevin dynamics (Girolami and Calderhead, 2011),
defines the sampling iterations as follows:
η
n
∼N(0, I)
θ
n
= θ
n1
+
2
(log π(θ
n1
)+log f(Y|θ
n1
)) + η
n
(14.41)
The sampling procedure 14.41 follows from HMC by a numerical solution of the ODE
in Step 2 of the algorithm using the leapfrog method (Neal, 2011). Parameter > 0in
Equation 14.41 determines the size of the leapfrog in the numerical solution of Hamiltonian
differential equations.
Stochastic Gradient Methods for Principled Estimation with Large Datasets 257
Welling and Teh (2011) studied a simple modification of Langevin dynamics (Equa-
tion 14.41) using a stochastic gradient as follows:
η
n
∼N(0,
n
)
θ
n
= θ
n1
+
n
2
(log π(θ
n1
)+(N/b)
ibatch
log f(Y
i
|θ
n1
)) + η
n
(14.42)
The step sizes
n
> 0 satisfy the typical Robbins–Monro requirements, that is,
i
=
and
2
i
< . Procedure 14.42 is using stochastic gradients averaged over a batch of b
data points, a technique usually employed in SGD to reduce noise in stochastic gradients.
Sato and Nakagawa (2014) proved that procedure 14.42 converges to the true posterior
f(θ|Y) using an elegant theory of stochastic calculus. Sampling through stochastic gradient
Langevin dynamics has since generated a lot of related work in posterior sampling for large
datasets, and it is still a rapidly expanding research area with contributions from various
disciplines (Hoffman et al., 2013, Korattikara et al., 2014, Pillai and Smith, 2014).
14.4.3 Reinforcement Learning
Reinforcement learning is the multidisciplinary study of how autonomous agents perceive,
learn, and interact with their environment (Bertsekas and Tsitsiklis, 1995). Typically, it
is assumed that time t proceeds in discrete steps, and at every step an agent is at state
x
t
∈X,whereX is the state space. On entering a state x
t
,twothingshappen.First,an
agent receives a probabilistic reward R(x
t
) R, and, second, the agent takes an action
a ∈A,whereA denotes the action space. This action is determined by the agent’s policy,
which is a function π : X→A, mapping a state to an action. Nature then decides a
transition to state x
t+1
according to a probability that is unknown to the agent.
One important task in reinforcement learning is to estimate the value function V
π
(x),
which quantifies the expected value of a specific state x ∈Xwith respect to policy π,
defined as
V
π
(x)=E (R(x)) + γE (R(x
1
)) + γ
2
E (R(x
2
)) + ··· (14.43)
where:
x
t
denotes the state that will be reached starting at x after t transitions
γ (0, 1) is a parameter that discounts future rewards
Uncertainty in R(x
t
) includes the uncertainty of the state x
t
because of the stochasticity in
state transitions, and the uncertainty from the reward distribution. Thus, V
π
(x)admitsa
recursive definition as follows:
V
π
(x)=E (R(x)) + γE (V
π
(x
1
)) (14.44)
When the state is a high-dimensional vector, one popular approach is to use a linear
approximation for V (x), such that V (x)=θ
φ(x), where φ(x) maps a state to a feature
space with fewer dimensions and θ
is a vector of fixed parameters. If the agent is at state
x
t
, then the recursive equation (Equation 14.44) can be rewritten as
E (R(x
t
) (θ
φ
t
γθ
φ
t+1
)|φ
t
) = 0 (14.45)
wherewesetφ
t
= φ(x
t
) for notational convenience. Similar to SGD, this suggests a
stochastic approximation method to estimate θ
through the following iteration:
θ
t+1
= θ
t
+ a
t
[R(x
t
) (θ
t
φ
t
γθ
t
φ
t+1
)] φ
t
(14.46)
258 Handbook of Big Data
where a
t
is a learning rate sequence that satisfies the Robbins–Monro conditions of Section
14.2. Procedure 14.46 is known as the temporal differences (TD) learning algorithm (Sutton,
1988). Implicit versions of this algorithm have recently emerged to solve the known stability
issues of the classical TD algorithm (Wang and Bertsekas, 2013, Tamar et al., 2014). For
example, Tamar et al. (2014) consider computing the term θ
t
φ
t
at the future iterate, the
resulting implicit TD algorithm being defined as
θ
t+1
=(I + a
t
φ
t
φ
t
)
1
[θ
t
+ a
t
(R(x
t
)+γθ
t
φ
t+1
)φ
t
] (14.47)
Similar to implicit SGD, iteration 14.47 stabilizes the TD iteration 14.46. With the advent
of online multiagent markets, methods and applications in reinforcement learning have been
receiving a renewed stream of research effort (Gosavi, 2009).
14.4.4 Deep Learning
Deep learning is the task of estimating parameters of statistical models that can be
represented by multiple layers of nonlinear operations, such as neural networks (Bengio,
2009). Such models, also referred to as deep architectures, consist of units that can perform
a basic prediction task, and are grouped in layers such that the output of one layer forms
the input of another layer that sits directly on top. Furthermore, the models are usually
augmented with latent units that are defined to represent structured quantities of interest,
such as edges or shapes in an image.
One basic building block of deep architectures is the Restricted Boltzmann Machine
(RBM). The complete-data density for one data point (X, Y ) of the states of hidden and
observed input units, respectively, is given by
f(X,Y ; θ)=
exp{−b
Y c
x X
WY}
Z(θ)
(14.48)
where θ =(b, c, W ) are the model parameters, and the function Z(θ)=
X,Y
exp{−b
Y
c
x X
WY}, also known as the partition function, acts as the normalizing constant.
Furthermore, the sample spaces for X and Y are discrete (e.g., binary) and finite. The
observed-data density is thus f(Y ; θ)=
X
f(X,Y ; θ). Let H(X, Y ; θ)=b
Y + c
x +
X
WY, such that f (X, Y ; θ)=(e
H(X,Y ;θ)
)/(Z(θ)). Also consider the observed data Y =
{Y
1
,Y
2
,...,Y
N
} and missing data X = {X
1
,X
2
,...,X
n
}.
Through simple algebra, one can obtain the gradient of the log-likelihood of observed
data in the following convenient form:
(θ; Y)=[E (H(X, Y; θ)) E (H(X, Y; θ)|Y)] (14.49)
where H(X, Y; θ)=
N
n=1
H(X
n
,Y
n
; θ). In practical situations, the data points (X
n
,Y
n
)
are binary. Therefore, the conditional distribution of the missing data X
n
|Y
n
is readily
available through a logistic regression model, and thus the second term of Equation 14.49
is easy to sample from. Similarly, Y
n
|X
n
is easy to sample from. However, the first term in
Equation 14.49 requires sampling from the joint distribution of the complete data (X, Y),
which conceptually is easy to do using the aforementioned conditionals and a Gibbs sampling
scheme (Geman and Geman, 1984). However, the domain for both X and Y is typically
very large, for example, it comprises thousands or millions of units, and thus a full Gibbs
on the joint distribution is impossible.
The method of contrastive divergence (Hinton, 2002, Carreira-Perpinan and Hinton,
2005) has been applied for training such models with considerable success. The algorithm
proceeds as follows for steps i =1, 2,...:
Stochastic Gradient Methods for Principled Estimation with Large Datasets 259
1. Sample one state Y
(i)
from the empirical distribution of observed data Y.
2. Sample X
(i)
|Y
(i)
, that is, the hidden state.
3. Sample Y
(i,new)
|X
(i)
.
4. Sample X
(i,new)
|Y
(i,new)
.
5. Evaluate the gradient (Equation 14.49) using (X
(i)
,Y
(i)
) for the second term and the
sample (X
(i,new)
,Y
(i,new)
)forthefirstterm.
6. Update the parameters in θ using constant-step-size SGD and the estimated gradient
from Step 5.
In other words, contrastive divergence attempts to estimate (θ; Y) in Equation 14.49.
This estimation is biased because (X
(i,new)
,Y
(i,new)
)isassumedtobefromtheexactjoint
distribution of (X, Y ); however, they are single Gibbs iterations starting from the observed
and imputed data (X
(i)
,Y
(i)
), respectively. In theory, Steps 3 and 4 could be repeated
k times; for example, if k →∞the sampling distribution of (X
(i,new)
,Y
(i,new)
)would
be the exact joint distribution of (X, Y ), leading to unbiased estimation of (θ; Y)of
Equation 14.49. Surprisingly, it has been empirically observed that k = 1 is enough for
good performance in many learning tasks (Hinton, 2002, Taylor et al., 2006, Salakhutdinov
et al., 2007, Bengio, 2009, Bengio and Delalleau, 2009), which is a testament to the power
and flexibility of stochastic gradient methods.
14.5 Glossary
SGD: Stochastic gradient descent.
References
Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Computation,
10(2):251–276.
Amari, S.-I., Chen, T.-P., and Cichocki, A. (1997). Stability analysis of learning algorithms
for blind source separation. Neural Networks, 10(8):1345–1351.
Amari, S.-I., Park, H., and Fukumizu, K. (2000). Adaptive method of realizing natural
gradient learning for multilayer perceptrons. Neural Computation, 12(6):1399–1409.
Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation
with convergence rate o(1/n). In Advances in Neural Information Processing Systems,
pp. 773–781.
Bather, J. (1989). Stochastic Approximation: A Generalisation of the Robbins-Monro
Procedure, volume 89. Mathematical Sciences Institute, Cornell University, Ithaca, NY.
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine
Learning, 2(1):1–127.
260 Handbook of Big Data
Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence.
Neural Computation, 21(6):1601–1621.
Bertsekas, D. P. (2011). Incremental proximal methods for large scale convex optimization.
Mathematical Programming, 129(2):163–195.
Bertsekas, D. P. and Tsitsiklis, J. N. (1995). Neuro-dynamic programming: An overview. In
Proceedings of the 34th IEEE Conference on Decision and Control, volume 1, pp. 560–564.
IEEE, New Orleans, LA.
Blum, J. R. (1954). Multidimensional stochastic approximation methods. The Annals of
Mathematical Statistics, 25:737–744.
Bordes, A., Bottou, L., and Gallinari, P. (2009). SGD-QN: Careful quasi-Newton stochastic
gradient descent. The Journal of Machine Learning Research, 10:1737–1754.
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In
Lechevallier, Y. and Saporta, G. (eds.), Proceedings of COMPSTAT, pp. 177–186.
Springer, Berlin, Germany.
Bottou, L. and Le Cun, Y. (2005). Online learning for very large datasets. Applied Stochastic
Models in Business and Industry, 21(2):137–151.
Bottou, L. and Murata, N. (2002). Stochastic approximations and efficient learning.
The Handbook of Brain Theory and Neural Networks, 2nd edition. The MIT Press,
Cambridge, MA.
Broyden, C. G. (1965). A class of methods for solving nonlinear simultaneous equations.
Mathematics of Computation, 19:577–593.
Capp´e, O. (2011). Online EM algorithm for Hidden Markov models. Journal of Computa-
tional and Graphical Statistics, 20(3):728–749.
Capp´e, O. and Moulines, E. (2009). Online Expectation–Maximization algorithm for latent
data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
71(3):593–613.
Carreira-Perpinan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning. In
Cowell, R. and Ghahramani, Z. (eds.), Proceedings of the 10th International Workshop
on Artificial Intelligence and Statistics, pp. 33–40. The Society for Artificial Intelligence
and Statistics.
Chung, K. L. (1954). On a stochastic approximation method. The Annals of Mathematical
Statistics, 25:463–483.
Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38.
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online
learning and stochastic optimization. The Journal of Machine Learning Research,
12:2121–2159.
Dupuis, P. and Simha, R. (1991). On sampling controlled stochastic approximation. IEEE
Transactions on Automatic Control, 36(8):915–924.
Fabian, V. (1968). On asymptotic normality in stochastic approximation. The Annals of
Mathematical Statistics, 39:1327–1332.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.239.214