258 Handbook of Big Data
where a
t
is a learning rate sequence that satisfies the Robbins–Monro conditions of Section
14.2. Procedure 14.46 is known as the temporal differences (TD) learning algorithm (Sutton,
1988). Implicit versions of this algorithm have recently emerged to solve the known stability
issues of the classical TD algorithm (Wang and Bertsekas, 2013, Tamar et al., 2014). For
example, Tamar et al. (2014) consider computing the term θ
t
φ
t
at the future iterate, the
resulting implicit TD algorithm being defined as
θ
t+1
=(I + a
t
φ
t
φ
t
)
−1
[θ
t
+ a
t
(R(x
t
)+γθ
t
φ
t+1
)φ
t
] (14.47)
Similar to implicit SGD, iteration 14.47 stabilizes the TD iteration 14.46. With the advent
of online multiagent markets, methods and applications in reinforcement learning have been
receiving a renewed stream of research effort (Gosavi, 2009).
14.4.4 Deep Learning
Deep learning is the task of estimating parameters of statistical models that can be
represented by multiple layers of nonlinear operations, such as neural networks (Bengio,
2009). Such models, also referred to as deep architectures, consist of units that can perform
a basic prediction task, and are grouped in layers such that the output of one layer forms
the input of another layer that sits directly on top. Furthermore, the models are usually
augmented with latent units that are defined to represent structured quantities of interest,
such as edges or shapes in an image.
One basic building block of deep architectures is the Restricted Boltzmann Machine
(RBM). The complete-data density for one data point (X, Y ) of the states of hidden and
observed input units, respectively, is given by
f(X,Y ; θ)=
exp{−b
Y −c
x − X
WY}
Z(θ)
(14.48)
where θ =(b, c, W ) are the model parameters, and the function Z(θ)=
X,Y
exp{−b
Y −
c
x − X
WY}, also known as the partition function, acts as the normalizing constant.
Furthermore, the sample spaces for X and Y are discrete (e.g., binary) and finite. The
observed-data density is thus f(Y ; θ)=
X
f(X,Y ; θ). Let H(X, Y ; θ)=b
Y + c
x +
X
WY, such that f (X, Y ; θ)=(e
−H(X,Y ;θ)
)/(Z(θ)). Also consider the observed data Y =
{Y
1
,Y
2
,...,Y
N
} and missing data X = {X
1
,X
2
,...,X
n
}.
Through simple algebra, one can obtain the gradient of the log-likelihood of observed
data in the following convenient form:
∇(θ; Y)=−[E (∇H(X, Y; θ)) − E (∇H(X, Y; θ)|Y)] (14.49)
where H(X, Y; θ)=
N
n=1
H(X
n
,Y
n
; θ). In practical situations, the data points (X
n
,Y
n
)
are binary. Therefore, the conditional distribution of the missing data X
n
|Y
n
is readily
available through a logistic regression model, and thus the second term of Equation 14.49
is easy to sample from. Similarly, Y
n
|X
n
is easy to sample from. However, the first term in
Equation 14.49 requires sampling from the joint distribution of the complete data (X, Y),
which conceptually is easy to do using the aforementioned conditionals and a Gibbs sampling
scheme (Geman and Geman, 1984). However, the domain for both X and Y is typically
very large, for example, it comprises thousands or millions of units, and thus a full Gibbs
on the joint distribution is impossible.
The method of contrastive divergence (Hinton, 2002, Carreira-Perpinan and Hinton,
2005) has been applied for training such models with considerable success. The algorithm
proceeds as follows for steps i =1, 2,...: