Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Appendix A

Theoretical foundations

A.1 Matrix Algebra

Basic Manipulations and Properties

A column vector x with d dimensions can be written

$x \equiv [\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{d} \end{matrix}] = {[\begin{matrix} x_{1} & x_{2} & \dots & x_{d} \end{matrix}]}^{T},$ $x \equiv [\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{d} \end{matrix}] = {[\begin{matrix} x_{1} & x_{2} & \dots & x_{d} \end{matrix}]}^{T},$

where the transpose operator, superscript T, allows it be written as a transposed row vector—which is useful when defining vectors in running text. In this book, vectors are assumed to be row vectors.

The transpose A^T of a matrix A involves copying all the rows of the original matrix A into the columns of A^T. Thus a matrix with m rows and n columns becomes a matrix with n rows and m columns:

$\begin{matrix} A \equiv [\begin{matrix} a_{11} & a_{12} & \dots & a_{1 n} \\ a_{21} & a_{21} & \dots & a_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{m 1} & a_{m 2} & \dots & a_{m n} \end{matrix}] & \Rightarrow & A^{T} = [\begin{matrix} a_{11} & a_{21} & \dots & a_{m 1} \\ a_{12} & a_{21} & \dots & a_{m 2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{1 n} & a_{2 n} & \dots & a_{n m} \end{matrix}] \end{matrix} .$ $\begin{matrix} A \equiv [\begin{matrix} a_{11} & a_{12} & \dots & a_{1 n} \\ a_{21} & a_{21} & \dots & a_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{m 1} & a_{m 2} & \dots & a_{m n} \end{matrix}] & \Rightarrow & A^{T} = [\begin{matrix} a_{11} & a_{21} & \dots & a_{m 1} \\ a_{12} & a_{21} & \dots & a_{m 2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{1 n} & a_{2 n} & \dots & a_{n m} \end{matrix}] \end{matrix} .$

The dot product or inner product between vector x and vector y of the same dimensionality yields a scalar quantity,

$x \cdot y = ⟨ x, y ⟩ = x^{T} y = \sum_{i = 1}^{D} x_{i} y_{i} .$ $x \cdot y = ⟨ x, y ⟩ = x^{T} y = \sum_{i = 1}^{D} x_{i} y_{i} .$

For example, the Euclidean norm can be written as the square root of the dot product of a vector x with itself, $| | x | |_{2} = \sqrt{x^{T} x}$ $| | x | |_{2} = \sqrt{x^{T} x}$ .

The tensor product or outer product ⊗ between an m-dimensional vector x and an n-dimensional vector y yields a matrix,

$x \otimes y \equiv x y^{T} = [\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{m} \end{matrix}] [\begin{matrix} y_{1} & y_{2} & \dots & y_{n} \end{matrix}] = [\begin{matrix} x_{1} y_{1} & x_{1} y_{2} & \dots & x_{1} y_{n} \\ x_{2} y_{1} & x_{2} y_{2} & \dots & x_{2} y_{n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{m} y_{1} & x_{m} y_{2} & \dots & x_{m} y_{n} \end{matrix}]$ $x \otimes y \equiv x y^{T} = [\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{m} \end{matrix}] [\begin{matrix} y_{1} & y_{2} & \dots & y_{n} \end{matrix}] = [\begin{matrix} x_{1} y_{1} & x_{1} y_{2} & \dots & x_{1} y_{n} \\ x_{2} y_{1} & x_{2} y_{2} & \dots & x_{2} y_{n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{m} y_{1} & x_{m} y_{2} & \dots & x_{m} y_{n} \end{matrix}]$

Given an N row and K column matrix A, and a K row and M column matrix B, if we write each row of A as $a_{n}^{T}$ $a_{n}^{T}$ and each column of B as b_m, the matrix product AB (i.e., matrix multiplication) can be written

$A B = [\begin{matrix} a_{1}^{T} \\ a_{2}^{T} \\ ⋮ \\ a_{N}^{T} \end{matrix}] [\begin{matrix} b_{1} & b_{2} & \dots & b_{M} \end{matrix}] = [\begin{matrix} a_{1}^{T} b_{1} & a_{1}^{T} b_{2} & \dots & a_{1}^{T} b_{M} \\ a_{2}^{T} b_{1} & a_{2}^{T} b_{2} & \dots & a_{2}^{T} b_{M} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{N}^{T} b_{1} & a_{N}^{T} b_{2} & \dots & a_{N}^{T} b_{M} \end{matrix}] .$ $A B = [\begin{matrix} a_{1}^{T} \\ a_{2}^{T} \\ ⋮ \\ a_{N}^{T} \end{matrix}] [\begin{matrix} b_{1} & b_{2} & \dots & b_{M} \end{matrix}] = [\begin{matrix} a_{1}^{T} b_{1} & a_{1}^{T} b_{2} & \dots & a_{1}^{T} b_{M} \\ a_{2}^{T} b_{1} & a_{2}^{T} b_{2} & \dots & a_{2}^{T} b_{M} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{N}^{T} b_{1} & a_{N}^{T} b_{2} & \dots & a_{N}^{T} b_{M} \end{matrix}] .$

If we write each column of A as a_k and each row of B as $b_{k}^{T}$ $b_{k}^{T}$ , the matrix product AB can also be written in terms of tensor products

$A B = [\begin{matrix} a_{1} & a_{2} & \dots & a_{K} \end{matrix}] [\begin{matrix} b_{1}^{T} \\ b_{2}^{T} \\ ⋮ \\ b_{K}^{T} \end{matrix}] = \sum_{k = 1}^{K} a_{k} b_{k}^{T} .$ $A B = [\begin{matrix} a_{1} & a_{2} & \dots & a_{K} \end{matrix}] [\begin{matrix} b_{1}^{T} \\ b_{2}^{T} \\ ⋮ \\ b_{K}^{T} \end{matrix}] = \sum_{k = 1}^{K} a_{k} b_{k}^{T} .$

The elementwise product or Hadamard product of two matrices that are of the same size is

$\begin{array}{l} A \circ B & = [\begin{matrix} a_{11} & a_{12} & \dots & a_{1 n} \\ a_{21} & a_{21} & \dots & a_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{m 1} & a_{m 2} & \dots & a_{m n} \end{matrix}] \circ [\begin{matrix} b_{11} & b_{12} & \dots & b_{1 n} \\ b_{21} & b_{21} & \dots & b_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ b_{m 1} & b_{m 2} & \dots & b_{m n} \end{matrix}] \\ = [\begin{matrix} a_{11} b_{11} & a_{12} b_{12} & \dots & a_{1 n} b_{1 n} \\ a_{211} b_{21} & b_{21} b_{21} & \dots & a_{2 n} b_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{m 1} b_{m 1} & a_{m 2} b_{m 2} & \dots & a_{m n} b_{m n} \end{matrix}] . \end{array}$ $\begin{array}{l} A \circ B & = [\begin{matrix} a_{11} & a_{12} & \dots & a_{1 n} \\ a_{21} & a_{21} & \dots & a_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{m 1} & a_{m 2} & \dots & a_{m n} \end{matrix}] \circ [\begin{matrix} b_{11} & b_{12} & \dots & b_{1 n} \\ b_{21} & b_{21} & \dots & b_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ b_{m 1} & b_{m 2} & \dots & b_{m n} \end{matrix}] \\ = [\begin{matrix} a_{11} b_{11} & a_{12} b_{12} & \dots & a_{1 n} b_{1 n} \\ a_{211} b_{21} & b_{21} b_{21} & \dots & a_{2 n} b_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{m 1} b_{m 1} & a_{m 2} b_{m 2} & \dots & a_{m n} b_{m n} \end{matrix}] . \end{array}$

A square matrix A with n rows and n columns is invertible if there exists a square matrix B=A⁻¹ such that AB=BA=I, where I is the identity matrix consisting of all zeros except that all diagonal elements are one. Square matrices that are not invertible are called singular. A square matrix A is singular if and only if its determinant, det(A), is zero. The following equation for the inverse shows why a zero determinant implies that a matrix is not invertible:

$A^{- 1} = \frac{1}{\det (A)} C^{T},$ $A^{- 1} = \frac{1}{\det (A)} C^{T},$

where det(A) is the determinant of A and C is another matrix known as the cofactor matrix. Finally, if a matrix A is orthogonal, then A⁻¹=A^T.

Derivatives of Vector and Scalar Functions

Given a scalar function y of an m-dimensional column vector x,

$\frac{\partial y}{\partial x} \equiv [\begin{matrix} \frac{\partial y}{\partial x_{1}} \\ \frac{\partial y}{\partial x_{2}} \\ ⋮ \\ \frac{\partial y_{1}}{\partial x_{m}} \end{matrix}] = g .$ $\frac{\partial y}{\partial x} \equiv [\begin{matrix} \frac{\partial y}{\partial x_{1}} \\ \frac{\partial y}{\partial x_{2}} \\ ⋮ \\ \frac{\partial y_{1}}{\partial x_{m}} \end{matrix}] = g .$

This quantity is known as the gradient, g. We have defined it here as a column vector, but it is sometimes defined as a row vector. Defining the gradient as a column vector implies certain orientations for the other quantities defined below, so keep in mind that the derivatives that follow are sometimes defined as the transposes of those given here. Using the definition and orientation above we can write the types of parameter updates frequently used in algorithms like gradient descent in vector form with expressions such as $θ^{new} = θ^{old} - g$ $θ^{new} = θ^{old} - g$ , where θ is a parameter (column) vector.

Given a scalar x and n-dimensional vector function y,

$\frac{\partial y}{\partial x} \equiv [\begin{matrix} \frac{\partial y_{1}}{\partial x} & \frac{\partial y_{2}}{\partial x} & \dots & \frac{\partial y_{n}}{\partial x} \end{matrix}] .$ $\frac{\partial y}{\partial x} \equiv [\begin{matrix} \frac{\partial y_{1}}{\partial x} & \frac{\partial y_{2}}{\partial x} & \dots & \frac{\partial y_{n}}{\partial x} \end{matrix}] .$

For an m-dimensional vector x and an n-dimensional vector y, the Jacobian matrix is given by

$\frac{\partial y}{\partial x} \equiv [\begin{matrix} \frac{\partial y_{1}}{\partial x_{1}} & \frac{\partial y_{2}}{\partial x_{1}} & \dots & \frac{\partial y_{n}}{\partial x_{1}} \\ \frac{\partial y_{1}}{\partial x_{2}} & \frac{\partial y_{2}}{\partial x_{2}} & \dots & \frac{\partial y_{n}}{\partial x_{2}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial y_{1}}{\partial x_{m}} & \frac{\partial y_{2}}{\partial x_{m}} & \dots & \frac{\partial y_{n}}{\partial x_{m}} \end{matrix}] .$ $\frac{\partial y}{\partial x} \equiv [\begin{matrix} \frac{\partial y_{1}}{\partial x_{1}} & \frac{\partial y_{2}}{\partial x_{1}} & \dots & \frac{\partial y_{n}}{\partial x_{1}} \\ \frac{\partial y_{1}}{\partial x_{2}} & \frac{\partial y_{2}}{\partial x_{2}} & \dots & \frac{\partial y_{n}}{\partial x_{2}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial y_{1}}{\partial x_{m}} & \frac{\partial y_{2}}{\partial x_{m}} & \dots & \frac{\partial y_{n}}{\partial x_{m}} \end{matrix}] .$

The Jacobian is sometimes defined as the transpose of this quantitiy even given the other definitions above. Watch out for the implications. The derivative of a scalar function y=f(X) with respect to an m×n dimensional matrix X is known as a gradient matrix and is given by

$\frac{\partial f}{\partial X} \equiv [\begin{matrix} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{12}} & \dots & \frac{\partial y}{\partial x_{1 n}} \\ \frac{\partial y}{\partial x_{21}} & \frac{\partial y}{\partial x_{22}} & \dots & \frac{\partial y}{\partial x_{2 n}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial y}{\partial x_{m 1}} & \frac{\partial y}{\partial x_{m 2}} & \dots & \frac{\partial y}{\partial x_{m n}} \end{matrix}] = G .$ $\frac{\partial f}{\partial X} \equiv [\begin{matrix} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{12}} & \dots & \frac{\partial y}{\partial x_{1 n}} \\ \frac{\partial y}{\partial x_{21}} & \frac{\partial y}{\partial x_{22}} & \dots & \frac{\partial y}{\partial x_{2 n}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial y}{\partial x_{m 1}} & \frac{\partial y}{\partial x_{m 2}} & \dots & \frac{\partial y}{\partial x_{m n}} \end{matrix}] = G .$

Our choice for the orientations of these quantities means that the gradient matrix has the same layout as the original matrix, so updates to a parameter matrix X take the form $X^{new} = X^{old} - G$ $X^{new} = X^{old} - G$ .

While many quantities can be expressed as scalars, vectors, or matrices, there are many that cannot. Inspired by the tabular visualization of Minka (2000), the scalar, vector, matrix, and tensor quantities resulting from taking the derivatives of different combinations of quantities are shown in Table A.1.

Table A.1

Quantities That Result From Various Derivatives (After Minka, 2000)

	Scalar $\frac{\partial f}{\cdot}$ $\frac{\partial f}{\cdot}$	Vector $\frac{\partial f}{\cdot}$ $\frac{\partial f}{\cdot}$	Matrix $\frac{\partial F}{\cdot}$ $\frac{\partial F}{\cdot}$
Scalar $\frac{\cdot}{\partial x}$ $\frac{\cdot}{\partial x}$	Scalar: $\frac{\partial f}{\partial x} = g$ $\frac{\partial f}{\partial x} = g$	Vector: $\frac{\partial f}{\partial x} \equiv [\frac{\partial f_{i}}{\partial x}] = g^{T}$ $\frac{\partial f}{\partial x} \equiv [\frac{\partial f_{i}}{\partial x}] = g^{T}$	Matrix: $\frac{\partial F}{\partial x} \equiv [\frac{\partial f_{i j}}{\partial x}] = G^{T}$ $\frac{\partial F}{\partial x} \equiv [\frac{\partial f_{i j}}{\partial x}] = G^{T}$
Vector $\frac{\cdot}{\partial x}$ $\frac{\cdot}{\partial x}$	Vector: $\frac{\partial f}{\partial x} \equiv [\frac{\partial f}{\partial x_{i}}] = g$ $\frac{\partial f}{\partial x} \equiv [\frac{\partial f}{\partial x_{i}}] = g$	Matrix: $\frac{\partial f}{\partial x} \equiv [\frac{\partial f_{i}}{\partial x_{j}}] = G$ $\frac{\partial f}{\partial x} \equiv [\frac{\partial f_{i}}{\partial x_{j}}] = G$	Tensor: $\frac{\partial F}{\partial x} \equiv [\frac{\partial F_{i j}}{\partial x_{k}}]$ $\frac{\partial F}{\partial x} \equiv [\frac{\partial F_{i j}}{\partial x_{k}}]$
Matrix $\frac{\cdot}{\partial X}$ $\frac{\cdot}{\partial X}$	Matrix: $\frac{\partial f}{\partial X} \equiv [\frac{\partial f}{\partial x_{i j}}] = G$ $\frac{\partial f}{\partial X} \equiv [\frac{\partial f}{\partial x_{i j}}] = G$	Tensor: $\frac{\partial f}{\partial X} \equiv [\frac{\partial f_{i}}{\partial x_{j k}}]$ $\frac{\partial f}{\partial X} \equiv [\frac{\partial f_{i}}{\partial x_{j k}}]$	Tensor: $\frac{\partial F}{\partial X} \equiv [\frac{\partial f_{i j}}{\partial x_{k l}}]$ $\frac{\partial F}{\partial X} \equiv [\frac{\partial f_{i j}}{\partial x_{k l}}]$

The Chain Rule

The chain rule for a function z, which is a function of y, which is a function of x, all of which are scalars, is

$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}$ $\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}$

where the two terms could be reversed, because multiplication is commutative. Now, given an m-dimensional vector x, an n-dimensional vector y, and an o-dimensional vector z, if z=z(y(x)), then

$\frac{\partial z}{\partial x} \equiv [\begin{matrix} \frac{\partial z_{1}}{\partial x_{1}} & \frac{\partial z_{2}}{\partial x_{1}} & \dots & \frac{\partial z_{o}}{\partial x_{1}} \\ \frac{\partial z_{1}}{\partial x_{2}} & \frac{\partial z_{2}}{\partial x_{2}} & \dots & \frac{\partial z_{o}}{\partial x_{2}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial z_{1}}{\partial x_{m}} & \frac{\partial z_{2}}{\partial x_{m}} & \dots & \frac{\partial z_{o}}{\partial x_{m}} \end{matrix}],$ $\frac{\partial z}{\partial x} \equiv [\begin{matrix} \frac{\partial z_{1}}{\partial x_{1}} & \frac{\partial z_{2}}{\partial x_{1}} & \dots & \frac{\partial z_{o}}{\partial x_{1}} \\ \frac{\partial z_{1}}{\partial x_{2}} & \frac{\partial z_{2}}{\partial x_{2}} & \dots & \frac{\partial z_{o}}{\partial x_{2}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial z_{1}}{\partial x_{m}} & \frac{\partial z_{2}}{\partial x_{m}} & \dots & \frac{\partial z_{o}}{\partial x_{m}} \end{matrix}],$

where each entry in the m×n matrix can be computed using

$\frac{\partial z_{i}}{\partial x_{j}} = \sum_{k = 1}^{n} \frac{\partial y_{k}}{\partial x_{j}} \frac{\partial z_{i}}{\partial y_{k}} = [\frac{\partial y}{\partial x_{j}}] [\frac{\partial z_{i}}{\partial y}] .$ $\frac{\partial z_{i}}{\partial x_{j}} = \sum_{k = 1}^{n} \frac{\partial y_{k}}{\partial x_{j}} \frac{\partial z_{i}}{\partial y_{k}} = [\frac{\partial y}{\partial x_{j}}] [\frac{\partial z_{i}}{\partial y}] .$

The vector form could be viewed as

$\begin{array}{l} \frac{\partial z}{\partial x} & = [\begin{matrix} \frac{\partial y_{1}}{\partial x_{1}} & \frac{\partial y_{2}}{\partial x_{1}} & \dots & \frac{\partial y_{n}}{\partial x_{1}} \\ \frac{\partial y_{1}}{\partial x_{2}} & \frac{\partial y_{2}}{\partial x_{2}} & \dots & \frac{\partial y_{n}}{\partial x_{2}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial y_{1}}{\partial x_{m}} & \frac{\partial y_{2}}{\partial x_{m}} & \dots & \frac{\partial y_{n}}{\partial x_{m}} \end{matrix}] [\begin{matrix} \frac{\partial z_{1}}{\partial y_{1}} & \frac{\partial z_{2}}{\partial y_{1}} & \dots & \frac{\partial z_{o}}{\partial y_{1}} \\ \frac{\partial z_{1}}{\partial y_{2}} & \frac{\partial z_{2}}{\partial y_{2}} & \dots & \frac{\partial z_{o}}{\partial y_{2}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial z_{1}}{\partial y_{n}} & \frac{\partial z_{2}}{\partial y_{n}} & \dots & \frac{\partial z_{o}}{\partial y_{n}} \end{matrix}] \\ \frac{\partial z}{\partial x} & = \frac{\partial y}{\partial x} \frac{\partial z}{\partial y}, \end{array}$ $\begin{array}{l} \frac{\partial z}{\partial x} & = [\begin{matrix} \frac{\partial y_{1}}{\partial x_{1}} & \frac{\partial y_{2}}{\partial x_{1}} & \dots & \frac{\partial y_{n}}{\partial x_{1}} \\ \frac{\partial y_{1}}{\partial x_{2}} & \frac{\partial y_{2}}{\partial x_{2}} & \dots & \frac{\partial y_{n}}{\partial x_{2}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial y_{1}}{\partial x_{m}} & \frac{\partial y_{2}}{\partial x_{m}} & \dots & \frac{\partial y_{n}}{\partial x_{m}} \end{matrix}] [\begin{matrix} \frac{\partial z_{1}}{\partial y_{1}} & \frac{\partial z_{2}}{\partial y_{1}} & \dots & \frac{\partial z_{o}}{\partial y_{1}} \\ \frac{\partial z_{1}}{\partial y_{2}} & \frac{\partial z_{2}}{\partial y_{2}} & \dots & \frac{\partial z_{o}}{\partial y_{2}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial z_{1}}{\partial y_{n}} & \frac{\partial z_{2}}{\partial y_{n}} & \dots & \frac{\partial z_{o}}{\partial y_{n}} \end{matrix}] \\ \frac{\partial z}{\partial x} & = \frac{\partial y}{\partial x} \frac{\partial z}{\partial y}, \end{array}$

which yields the chain rule for vectors, where chains extend toward the left as opposed to the right as is often done with the scalar version. For the special case when the final function evaluates to a scalar—which is frequently encountered when optimizing a loss function, we have

$\begin{array}{l} \frac{\partial z}{\partial x_{j}} & = \sum_{k = 1}^{n} \frac{\partial y_{k}}{\partial x_{j}} \frac{\partial z}{\partial y_{k}}, \\ \frac{\partial z}{\partial x} & = \frac{\partial y}{\partial x} \frac{\partial z}{\partial y} . \end{array}$ $\begin{array}{l} \frac{\partial z}{\partial x_{j}} & = \sum_{k = 1}^{n} \frac{\partial y_{k}}{\partial x_{j}} \frac{\partial z}{\partial y_{k}}, \\ \frac{\partial z}{\partial x} & = \frac{\partial y}{\partial x} \frac{\partial z}{\partial y} . \end{array}$

The rule generalizes, so that if there were yet another vector function w that is a function of x, through z, then

$\frac{\partial w}{\partial x} = \frac{\partial y}{\partial x} \frac{\partial z}{\partial y} \frac{\partial w}{\partial z} .$ $\frac{\partial w}{\partial x} = \frac{\partial y}{\partial x} \frac{\partial z}{\partial y} \frac{\partial w}{\partial z} .$

To find the derivative of a matrix that is the function of another matrix, the chain rule generalizes. For example, for a matrix X if matrix Y=f(X), the derivative of a function g(Y) is

$\begin{array}{l} \frac{\partial g (Y)}{\partial X} & = \frac{\partial g (f (X))}{\partial X} \\ \frac{\partial g (Y)}{\partial x_{i j}} & = \sum_{k = 1}^{K} \sum_{l = 1}^{L} \frac{\partial g (Y)}{\partial y_{k l}} \frac{\partial y_{k l}}{\partial x_{i j}} \end{array}$ $\begin{array}{l} \frac{\partial g (Y)}{\partial X} & = \frac{\partial g (f (X))}{\partial X} \\ \frac{\partial g (Y)}{\partial x_{i j}} & = \sum_{k = 1}^{K} \sum_{l = 1}^{L} \frac{\partial g (Y)}{\partial y_{k l}} \frac{\partial y_{k l}}{\partial x_{i j}} \end{array}$

Computation Graphs and Backpropagation

Computation networks help to show how the gradients required for deep learning with backpropagation can be computed. They also provide the basis for many software packages for deep learning, which partially or fully automate the computations involved.

We begin with an example that computes intermediate quantities that are scalars, and then extend it to networks involving vectors that represent entire layers of variables at each node. Fig. A.1 gives a computation graph that implements the function z₁(y₁,z₂(y₂(y₁),z₃(y₃(y₂(y₁))))), and shows how to compute the gradients. The chain rule for a scalar function a involving intermediate results b₁,…,b_k that are dependent on c is

$\frac{\partial a (b_{1}, \dots, b_{k})}{\partial c} = \sum_{k = 1}^{n} \frac{\partial a}{\partial b_{k}} \frac{\partial b_{k}}{\partial c} .$ $\frac{\partial a (b_{1}, \dots, b_{k})}{\partial c} = \sum_{k = 1}^{n} \frac{\partial a}{\partial b_{k}} \frac{\partial b_{k}}{\partial c} .$

Figure A.1 Decomposing partial derivatives using a computation graph.

In the example, the partial derivative of z₁ with respect to y₁ therefore consists of three terms

$\begin{array}{l} \frac{\partial z_{1}}{\partial y_{1}} & = \underset{(1)}{\underset{︸}{\frac{\partial z_{1}}{\partial y_{1}}}} + \underset{(2)}{\underset{︸}{\frac{\partial z_{1}}{\partial z_{2}} \frac{\partial z_{2}}{\partial y_{2}} \frac{\partial y_{2}}{\partial y_{1}}}} + \underset{(3)}{\underset{︸}{\frac{\partial z_{1}}{\partial z_{2}} \frac{\partial z_{2}}{\partial z_{3}} \frac{\partial z_{3}}{\partial y_{3}} \frac{\partial y_{3}}{\partial y_{2}} \frac{\partial y_{2}}{\partial y_{1}}}} \\ = \frac{\partial z_{1}}{\partial y_{1}} + \frac{\partial z_{1}}{\partial z_{2}} [\frac{\partial z_{2}}{\partial y_{2}} + \frac{\partial z_{2}}{\partial z_{3}} \frac{\partial z_{3}}{\partial y_{3}} \frac{\partial y_{3}}{\partial y_{2}}] \frac{\partial y_{2}}{\partial y_{1}} \end{array}$ $\begin{array}{l} \frac{\partial z_{1}}{\partial y_{1}} & = \underset{(1)}{\underset{︸}{\frac{\partial z_{1}}{\partial y_{1}}}} + \underset{(2)}{\underset{︸}{\frac{\partial z_{1}}{\partial z_{2}} \frac{\partial z_{2}}{\partial y_{2}} \frac{\partial y_{2}}{\partial y_{1}}}} + \underset{(3)}{\underset{︸}{\frac{\partial z_{1}}{\partial z_{2}} \frac{\partial z_{2}}{\partial z_{3}} \frac{\partial z_{3}}{\partial y_{3}} \frac{\partial y_{3}}{\partial y_{2}} \frac{\partial y_{2}}{\partial y_{1}}}} \\ = \frac{\partial z_{1}}{\partial y_{1}} + \frac{\partial z_{1}}{\partial z_{2}} [\frac{\partial z_{2}}{\partial y_{2}} + \frac{\partial z_{2}}{\partial z_{3}} \frac{\partial z_{3}}{\partial y_{3}} \frac{\partial y_{3}}{\partial y_{2}}] \frac{\partial y_{2}}{\partial y_{1}} \end{array}$

The sums needed to compute this involve following back along the flows of the subcomputations that were performed to evaluate the original function. These can be implemented efficiently by passing them between nodes in the graph, as Fig. A.1 shows.

This high-level notion of following flows in a graph generalizes to deep networks involving entire layers of variables. If Fig. A.1 were drawn using a scalar for z₁ and vectors for each of the other nodes, the partial derivatives could be replaced by their vector versions. It is necessary to reverse the order of the multiplications, because in the case of partial derivatives of vectors with respect to vectors, computations grow to the left, yielding

$\frac{\partial z_{1}}{\partial y_{1}} = \frac{\partial z_{1}}{\partial y_{1}} + \frac{\partial y_{2}}{\partial y_{1}} [\frac{\partial z_{2}}{\partial y_{2}} + \frac{\partial y_{3}}{\partial y_{2}} \frac{\partial z_{3}}{\partial y_{3}} \frac{\partial z_{2}}{\partial z_{3}}] \frac{\partial z_{1}}{\partial z_{2}}$ $\frac{\partial z_{1}}{\partial y_{1}} = \frac{\partial z_{1}}{\partial y_{1}} + \frac{\partial y_{2}}{\partial y_{1}} [\frac{\partial z_{2}}{\partial y_{2}} + \frac{\partial y_{3}}{\partial y_{2}} \frac{\partial z_{3}}{\partial y_{3}} \frac{\partial z_{2}}{\partial z_{3}}] \frac{\partial z_{1}}{\partial z_{2}}$

To see this, consider (1) the chain rule for the partial derivative of a scalar function z with a vector x as argument, but involving the computation of an intermediate vector y:

$\begin{array}{l} \frac{\partial z (y)}{\partial x_{j}} & = \sum_{k = 1}^{n} \frac{\partial y_{k}}{\partial x_{j}} \frac{\partial z}{\partial y_{k}} \\ \frac{\partial z}{\partial x} & = \frac{\partial y}{\partial x} \frac{\partial z}{\partial y} \\ = D d, \end{array}$ $\begin{array}{l} \frac{\partial z (y)}{\partial x_{j}} & = \sum_{k = 1}^{n} \frac{\partial y_{k}}{\partial x_{j}} \frac{\partial z}{\partial y_{k}} \\ \frac{\partial z}{\partial x} & = \frac{\partial y}{\partial x} \frac{\partial z}{\partial y} \\ = D d, \end{array}$

and (2) the chain rule for the same scalar function z(x), but involving the computation of an intermediate matrix Y:

$\begin{array}{l} \frac{\partial z (Y)}{\partial x} & = \sum_{l = 1}^{L} \frac{\partial y_{l}}{\partial x} \frac{\partial z}{\partial y_{l}} \\ = \sum_{l = 1}^{L} D_{l} d_{l} . \end{array}$ $\begin{array}{l} \frac{\partial z (Y)}{\partial x} & = \sum_{l = 1}^{L} \frac{\partial y_{l}}{\partial x} \frac{\partial z}{\partial y_{l}} \\ = \sum_{l = 1}^{L} D_{l} d_{l} . \end{array}$

Derivatives of Functions of Vectors and Matrices

Here are some useful derivatives for functions of vectors and matrices. Petersen and Pedersen (2012) gives an even larger list.

$\begin{array}{l} \frac{\partial}{\partial x} A x = A^{T} \\ \frac{\partial}{\partial x} x^{T} x = 2 x \\ \frac{\partial}{\partial a} a^{T} x = \frac{\partial}{\partial a} x^{T} a = x \\ \frac{\partial}{\partial x} x^{T} A x = A x + A^{T} x \\ \frac{\partial}{\partial A} y^{T} A x = y x^{T} \\ \frac{\partial}{\partial x} {(a - x)}^{T} (a - x) = - 2 (a - x) \end{array}$ $\begin{array}{l} \frac{\partial}{\partial x} A x = A^{T} \\ \frac{\partial}{\partial x} x^{T} x = 2 x \\ \frac{\partial}{\partial a} a^{T} x = \frac{\partial}{\partial a} x^{T} a = x \\ \frac{\partial}{\partial x} x^{T} A x = A x + A^{T} x \\ \frac{\partial}{\partial A} y^{T} A x = y x^{T} \\ \frac{\partial}{\partial x} {(a - x)}^{T} (a - x) = - 2 (a - x) \end{array}$

Notice that the first identity above would be equal to simply A, had we defined the Jacobian to be the transpose of our definition above.

For a symmetric matrix C (e.g., an inverse covariance matrix),

$\begin{array}{l} \frac{\partial}{\partial a} {(a - b)}^{T} C (a - b) = 2 C (a - b) \\ \frac{\partial}{\partial b} {(a - b)}^{T} C (a - b) = - 2 C (a - b) \\ \frac{\partial}{\partial w} {(y - A w)}^{T} C (y - A w) = - 2 A^{T} C (y - A w) \end{array}$ $\begin{array}{l} \frac{\partial}{\partial a} {(a - b)}^{T} C (a - b) = 2 C (a - b) \\ \frac{\partial}{\partial b} {(a - b)}^{T} C (a - b) = - 2 C (a - b) \\ \frac{\partial}{\partial w} {(y - A w)}^{T} C (y - A w) = - 2 A^{T} C (y - A w) \end{array}$

Vector Taylor Series Expansion, Second-Order Methods, and Learning Rates

The method of gradient descent, the interpretation of learning rates, and more sophisticated second-order methods can be viewed through the lens of the Taylor series expansion of a function. The approach presented below is also known as Newton’s method.

The Taylor expansion of a function near point x_o can be written

$f (x) = f (x_{o}) + \frac{f' (x_{o})}{1!} (x - x_{o}) + \frac{f ″ (x_{o})}{2!} {(x - x_{o})}^{2} + \frac{f^{(3)} (x_{o})}{3!} {(x - x_{o})}^{3} + \dots$ $f (x) = f (x_{o}) + \frac{f' (x_{o})}{1!} (x - x_{o}) + \frac{f ″ (x_{o})}{2!} {(x - x_{o})}^{2} + \frac{f^{(3)} (x_{o})}{3!} {(x - x_{o})}^{3} + \dots$

Using the approximation up to the second-order (squared) terms in x, taking the derivative, setting the result to zero, and solving for x, gives

$\begin{array}{l} 0 & = \frac{d}{d x} [f (x_{o}) + f' (x_{o}) (x - x_{o}) + \frac{f ″ (x_{o})}{2} {(x - x_{o})}^{2}] \\ = f' (x_{o}) + f ″ (x_{o}) (x - x_{o}), thus solving for δ x \equiv (x - x_{o}) \\ \Rightarrow δ x = - \frac{f' (x_{o})}{f ″ (x_{o})}, or x = x_{o} - \frac{f' (x_{o})}{f ″ (x_{o})} . \end{array}$ $\begin{array}{l} 0 & = \frac{d}{d x} [f (x_{o}) + f' (x_{o}) (x - x_{o}) + \frac{f ″ (x_{o})}{2} {(x - x_{o})}^{2}] \\ = f' (x_{o}) + f ″ (x_{o}) (x - x_{o}), thus solving for δ x \equiv (x - x_{o}) \\ \Rightarrow δ x = - \frac{f' (x_{o})}{f ″ (x_{o})}, or x = x_{o} - \frac{f' (x_{o})}{f ″ (x_{o})} . \end{array}$

This generalizes to the vector version of a Taylor series for a scalar function with matrix arguments, where

$\begin{array}{l} f (θ) = f (θ_{o}) + g_{o}^{T} (θ - θ_{o}) + \frac{1}{2} {(θ - θ_{o})}^{T} H_{o} (θ - θ_{o}) + \dots, \\ g_{o} = \frac{d f}{d θ_{o}}, H_{o} = \frac{d}{d θ_{o}} \frac{d f}{d θ_{o}} = \frac{d^{2} f}{d θ_{o}^{2}} . \end{array}$ $\begin{array}{l} f (θ) = f (θ_{o}) + g_{o}^{T} (θ - θ_{o}) + \frac{1}{2} {(θ - θ_{o})}^{T} H_{o} (θ - θ_{o}) + \dots, \\ g_{o} = \frac{d f}{d θ_{o}}, H_{o} = \frac{d}{d θ_{o}} \frac{d f}{d θ_{o}} = \frac{d^{2} f}{d θ_{o}^{2}} . \end{array}$

Using the identities given in the previous section, taking the derivative with respect to the parameter vector θ, setting it to zero and then solving for the point at which the quadratic approximation to the function would be zero, yields

$\begin{array}{l} \frac{d f (θ)}{d θ} & = 0 = g_{o} + H_{o} (θ - θ_{o}), Δ θ \equiv (θ - θ_{o}) \\ \Rightarrow Δ θ & = - H_{o}^{- 1} g_{o} . \end{array}$ $\begin{array}{l} \frac{d f (θ)}{d θ} & = 0 = g_{o} + H_{o} (θ - θ_{o}), Δ θ \equiv (θ - θ_{o}) \\ \Rightarrow Δ θ & = - H_{o}^{- 1} g_{o} . \end{array}$

This means that for the update $θ^{new} = θ_{o} - H_{o}^{- 1} g_{o}$ $θ^{new} = θ_{o} - H_{o}^{- 1} g_{o}$ , the learning rate used for gradient descent can be thought of in terms of a simple diagonal matrix approximation to the inverse Hessian matrix in a second-order method. In other words, using a simple learning rate is analogous to making an approximation, so that $H_{o}^{- 1} = η I$ $H_{o}^{- 1} = η I$ .

Full-blown second-order methods take more effective steps at each iteration. However, they can be expensive because of the need to compute this quantity. For convex problems like logistic regression, a popular second-order method known as L-BFGS builds an approximation to the Hessian. Here, L stands for limited memory and BFGS stands for the inventors of the approach, Broyden–Fletcher–Goldfarb–Shanno. Another family of approaches are known as the conjugate gradient algorithms and involve working with the linear system of equations associated with $H_{o} x = - g_{o}$ $H_{o} x = - g_{o}$ when solving for $x = Δ θ$ $x = Δ θ$ as opposed to computing the inverse of $H_{o}$ $H_{o}$ .

One needs to keep in mind that when solving nonconvex problems (e.g., learning for multilayer neural networks) the Hessian is not guaranteed to be positive definite, which means that it may not even be invertible. Consequently the use of heuristic adaptive learning rates and momentum terms remain popular and effective for neural network methods.

Eigenvectors, Eigenvalues, and Covariance Matrices

There is a strong connection between eigenvectors, eigenvalues and diagonalization of a covariance matrix, and the method of principal component analysis. If λ is a scalar eigenvalue of a matrix A, there exists a vector x called an eigenvector of A such that Ax=λx. Define a matrix Φ to consist of eigenvectors in each column, and define Λ as a matrix with the corresponding eigenvalues on the diagonal, then the matrix equation $Α Φ = Φ Λ$ $Α Φ = Φ Λ$ defines the eigenvalues and eigenvectors of A.

Many numerical linear algebra software packages (e.g., Matlab) can determine solutions to this equation. If the eigenvectors in Φ are orthogonal, which they are for symmetric matrices, the inverse of Φ equals its transpose, which implies that we could equally well write $Φ^{T} Α Φ = Λ$ $Φ^{T} Α Φ = Λ$ . To find the eigenvectors of a covariance matrix, set A=Σ, the covariance matrix. This yields the definition of the eigenvectors of a covariance matrix Σ as the set of orthogonal vectors stored in a matrix Φ and normalized to have unit length, such that $Φ^{T} Σ Φ = Λ$ $Φ^{T} Σ Φ = Λ$ . Since Λ is diagonal matrix of eigenvalues, the operation $Φ^{T} Σ Φ$ $Φ^{T} Σ Φ$ has been used to diagonalize the covariance matrix.

While these results may seem esoteric at first glance, their use is widespread. For example, in computer vision an eigenanalysis-based principal component analysis for face recognition yields what are known as “eigenfaces.” The general technique is widely used in numerous other contexts, and classic, well-cited eigenanalysis-based papers appear in many diverse fields.

The Singular Value Decomposition

The singular value decomposition is a type of matrix factorization that is widely used in data mining and machine learning settings and is implemented as a core routine in many numerical linear algebra packages. It decomposes a matrix X into the product of three matrices such that X=USV^T, where U has orthogonal columns, S is a diagonal matrix containing the singular values (normally) sorted along the diagonal, and V also has orthogonal columns. By keeping only the k largest singular values, this factorization allows the data matrix to be reconstructed in a way that is optimal in a least squares sense for each value of k. For any given k, we could therefore write $X \approx U_{k} S_{k} V_{k}^{T}$ $X \approx U_{k} S_{k} V_{k}^{T}$ . Fig. 9.10 illustrates how this works visually.

In our discussion earlier on eigendecompositions we developed an expression for diagonalizing a covariance matrix Σ using $Φ^{T} Σ Φ = Λ$ $Φ^{T} Σ Φ = Λ$ , where Φ holds the eigenvectors and Λ is a diagonal matrix of eigenvalues. This amounts to seeking a decomposition of the covariance matrix that factorizes into $Σ = Φ Λ Φ^{T}$ $Σ = Φ Λ Φ^{T}$ . This reveals the relationship between principal component analysis and the singular value decomposition applied to data stored in the columns of matrix X. We will use the fact that the covariance matrix Σ for mean centered data stored as vectors in the columns of X is simply $Σ = X X^{T}$ $Σ = X X^{T}$ . Since orthogonal matrices have the property that UU^T=I, through the following substitution we can see that the matrix Φ, known as the right singular vectors of X, corresponds to the eigenvectors of the covariance matrix. In other words, to factorize a covariance matrix into $Σ = Φ Λ Φ^{T}$ $Σ = Φ Λ Φ^{T}$ , mean center the data and perform an singular value decomposition on X. Then the covariance matrix is $Σ = X X^{T} = U S D^{T} D S^{T} U^{T} = U S^{2} U^{T}$ $Σ = X X^{T} = U S D^{T} D S^{T} U^{T} = U S^{2} U^{T}$ , so $U = Φ$ $U = Φ$ and $S = Λ^{\frac{1}{2}}$ $S = Λ^{\frac{1}{2}}$ —in other words, the so-called singular values are the square roots of the eigenvalues.

A.2 Fundamental Elements of Probabilistic Methods

Expectations

The expectation of a discrete random variable X is

$E [X] = \sum_{x} x P (X = x),$ $E [X] = \sum_{x} x P (X = x),$

where the sum is over all possible values for X. The conditional expectation for discrete random variable X given random variable Y=y has a similar form

$E [X | Y = y] = \sum_{x} x P (X = x | Y = y) .$ $E [X | Y = y] = \sum_{x} x P (X = x | Y = y) .$

Given a probability density function p(x) for a continuous random variable X,

$E [X] = \int_{- \infty}^{\infty} x p (x) d x .$ $E [X] = \int_{- \infty}^{\infty} x p (x) d x .$

The empirical expectation of a continuous-valued variable X is obtained by placing a Dirac delta function on each empirical observation or example, and normalizing by the number of examples, to define p(x). The expected value of a matrix is defined as a matrix of expected values.

The expectation of a function of a continuous random variable X and a discrete random variable y is

$E [f (X, Y)] = \int_{- \infty}^{\infty} \sum_{y} f (x, Y) p (x, Y = y) d x .$ $E [f (X, Y)] = \int_{- \infty}^{\infty} \sum_{y} f (x, Y) p (x, Y = y) d x .$

Expectations of sums of random variables are equal to sums of expectations, or

$E [X + Y] = E [X] + E [Y] .$ $E [X + Y] = E [X] + E [Y] .$

If there is a scaling factor s and bias or constant c, that

$E [s X + c] = s E [X] + c .$ $E [s X + c] = s E [X] + c .$

The variance is defined as

$\begin{array}{l} Var [X] & = \sum_{x} {(x - E [X])}^{2} p (X = x) \\ = E [(X - E [X]) (X - E [X])] \\ = E [X^{2} - 2 X E [X] + {(E [X])}^{2}] \\ = E [X^{2}] - 2 E [X] E [X] + {(E [X])}^{2} \\ = E [X^{2}] - E {[X]}^{2} . \end{array}$ $\begin{array}{l} Var [X] & = \sum_{x} {(x - E [X])}^{2} p (X = x) \\ = E [(X - E [X]) (X - E [X])] \\ = E [X^{2} - 2 X E [X] + {(E [X])}^{2}] \\ = E [X^{2}] - 2 E [X] E [X] + {(E [X])}^{2} \\ = E [X^{2}] - E {[X]}^{2} . \end{array}$

The expectation of the product of continuous random variables X and Y with joint probability p(x,y) is given by

$E [X Y] = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} x y p (x, y) d x d y .$ $E [X Y] = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} x y p (x, y) d x d y .$

The covariance between X and Y is given by

$\begin{array}{l} Cov [X, Y] & = E [(X - E [X]) (Y - E [Y])] \\ = \sum_{x} \sum_{y} (x - E [X]) (y - E [Y]) p (X = x, Y = y) \\ = E [X Y] - E [X] E [Y] . \end{array}$ $\begin{array}{l} Cov [X, Y] & = E [(X - E [X]) (Y - E [Y])] \\ = \sum_{x} \sum_{y} (x - E [X]) (y - E [Y]) p (X = x, Y = y) \\ = E [X Y] - E [X] E [Y] . \end{array}$

Therefore $Cov [X, Y] = 0 \Rightarrow E [X Y] = E [X] E [Y]$ $Cov [X, Y] = 0 \Rightarrow E [X Y] = E [X] E [Y]$ , and X and Y are said to be uncorrelated. Clearly Cov[X,X]=Var[X]. The covariance matrix for a d-dimensional continuous random variable x is obtained from

$Cov [x] = [\begin{matrix} Cov (x_{1}, x_{1}) & \dots & Cov (x_{1}, x_{d}) \\ ⋮ & ⋮ & ⋮ \\ Cov (x_{d}, x_{1}) & \dots & Cov (x_{d}, x_{d}) \end{matrix}] .$ $Cov [x] = [\begin{matrix} Cov (x_{1}, x_{1}) & \dots & Cov (x_{1}, x_{d}) \\ ⋮ & ⋮ & ⋮ \\ Cov (x_{d}, x_{1}) & \dots & Cov (x_{d}, x_{d}) \end{matrix}] .$

Conjugate Priors

In more fully Bayesian methods one treats both variables and parameters as random quantities. The use of prior distributions over parameters can provide simple and well justified ways to regularize model parameters and avoid overfitting. Applying the Bayesian modeling philosophy and techniques can lead to simple adjustments to traditional maximum likelihood estimates. In particular, the use of a conjugate prior distribution for a parameter in an appropriately defined probability model means that the posterior distribution over that parameter will remain in the same form as the prior. This makes it easy to adapt traditional maximum likelihood estimates for parameters using simple weighted averages of the maximum likelihood estimate and the relevant parameters of the conjugate prior. We will see how this works for the Bernoulli, categorical, and Gaussian distributions below. Other more sophisticated Bayesian manipulations are also simplified through the use of conjugacy.

Bernoulli, Binomial, and Beta Distributions

The Bernoulli probability distribution is defined for binary random variables. Suppose $x \in {0, 1}$ $x \in {0, 1}$ , the probability of x=1 is given by π and the probability of x=0 is given by 1−π. The probability distribution can be written in the following way

$P (x; π) = π^{x} {(1 - π)}^{1 - x} .$ $P (x; π) = π^{x} {(1 - π)}^{1 - x} .$

The binomial distribution generalizes the Bernoulli distribution. It defines the probability for a certain number of successes in a sequence of binary experiments, where the outcome of each experiment is governed by a Bernoulli distribution. The probability of exactly k successes in n experiments under the binomial distribution is

$P (k; n, π) = (\begin{matrix} n \\ k \end{matrix}) π^{k} {(1 - π)}^{n - k},$ $P (k; n, π) = (\begin{matrix} n \\ k \end{matrix}) π^{k} {(1 - π)}^{n - k},$

and defined for k=0, 1, 2,…, n where

$(\begin{matrix} n \\ k \end{matrix}) = \frac{n!}{k! (n - k)!}$ $(\begin{matrix} n \\ k \end{matrix}) = \frac{n!}{k! (n - k)!}$

is the binomial coefficient. Intuitively, the binomial coefficient is needed to account for the fact that the definition of this distribution ignores the order of the results of the experiments—the k results where x=1 could have occurred anywhere in the sequence of the n experiments. The binomial coefficient gives the number of different ways in which one could have obtained the k results where x=1. Intuitively, the term π^k is the probability of exactly k results where x=1, and π^n−k is the probability of having exactly n−k results where x=0. These two terms are valid for each of the possible ways in which the sequence of outcomes could have occurred, and we therefore simply multiply by the number of possibilities.

The Beta distribution is defined for a random variable π where $0 \leq π \leq 1$ $0 \leq π \leq 1$ . It uses two shape parameters $α, β > 0$ $α, β > 0$ such that

$P (π; α, β) = \frac{1}{B (α, β)} π^{α - 1} {(1 - π)}^{β - 1},$ $P (π; α, β) = \frac{1}{B (α, β)} π^{α - 1} {(1 - π)}^{β - 1},$ (A.1)

(A.1)

where $B (α, β)$ $B (α, β)$ is the beta function and serves as a normalization constant that ensures that the function integrates to one. The Beta distribution is useful because it can be used as a conjugate prior distribution for the Bernoulli and binomial distributions. Its mean is

$π_{B} = (\frac{α}{α + β}),$ $π_{B} = (\frac{α}{α + β}),$

and it can be shown that if the maximum likelihood estimate for the Bernoulli distribution is given by π_ML then the posterior mean $π_{*}$ $π_{*}$ of the Beta distribution is

$π_{*} = w π_{B} + (1 - w) π_{ML},$ $π_{*} = w π_{B} + (1 - w) π_{ML},$

where

$w = \frac{α + β}{α + β + n},$ $w = \frac{α + β}{α + β + n},$

and n is the number of examples used to estimate π_ML. The use of the posterior mean value $π_{*}$ $π_{*}$ as the regularized or smoothed estimate, replacing π_ML in a Bernoulli model, is therefore justified under Bayesian principles by the fact that the mean value of the posterior predictive distribution of a Beta-Bernoulli model is equivalent to plugging the posterior mean parameters of the Beta into the Bernoulli, i.e.

$p (x | D) = \int_{0}^{1} Bern (x | π) Beta (π | D) d π = Bern (x; π_{*}) .$ $p (x | D) = \int_{0}^{1} Bern (x | π) Beta (π | D) d π = Bern (x; π_{*}) .$

This supports the intuitive notion of thinking of α and β as imaginary observations for x=1 and x=0, respectively, and justifies it in a Bayesian sense.

Categorical, Multinomial, and Dirichlet Distributions

The categorical distribution is defined for discrete random variables with more than two states; it generalizes the Bernoulli distribution. For K categories one might define $A \in {a_{1}, a_{2}, \dots, a_{K}}$ $A \in {a_{1}, a_{2}, \dots, a_{K}}$ or $x \in {1, 2, \dots, K}$ $x \in {1, 2, \dots, K}$ ; however, the order of the integers used to encode the categories is arbitrary. If the probability of x being in state or category k is given by π_k, and if we use a one hot encoding for a vector representation x in which all the elements of x are zero except for exactly one dimension that is equal to 1, representing the state or category of x, then the categorical distribution is

$P (x; π) = \prod_{k = 1}^{K} π_{k}^{x_{k}} .$ $P (x; π) = \prod_{k = 1}^{K} π_{k}^{x_{k}} .$

The multinomial distribution generalizes the categorical distribution. Given multiple independent observations of a discrete random variable with a fixed categorical probability π_k for each class k, the multinomial distribution defines the probability of observing a particular number of instances of each category. If the vector x is defined as the number of times each category has been observed, then the multinomial distribution can be expressed as

$P (x; n, π) = (\begin{matrix} n! \\ x_{1}! \dots x_{K}! \end{matrix}) \prod_{k = 1}^{K} π_{k}^{x_{k}} .$ $P (x; n, π) = (\begin{matrix} n! \\ x_{1}! \dots x_{K}! \end{matrix}) \prod_{k = 1}^{K} π_{k}^{x_{k}} .$

The Dirichlet distribution is defined for a random variable or parameter vector π such that $π_{1}, \dots, π_{K} > 0$ $π_{1}, \dots, π_{K} > 0$ , $π_{1}, \dots, π_{K} < 1$ $π_{1}, \dots, π_{K} < 1$ , $π_{1} + π_{2} + \dots, π_{K} = 1$ $π_{1} + π_{2} + \dots, π_{K} = 1$ , which is precisely the form of π used to define the categorical and multinomial distributions above. The Dirichlet distribution with parameters $α_{1}, \dots, α_{K} > 0$ $α_{1}, \dots, α_{K} > 0$ , $K \geq 2$ $K \geq 2$ is

$P (π; α) = \frac{1}{B (α)} \prod_{i = k}^{K} π_{k}^{α_{k} - 1}$ $P (π; α) = \frac{1}{B (α)} \prod_{i = k}^{K} π_{k}^{α_{k} - 1}$

where B(α), the multinomial beta function, serves as the normalization constant that ensures that the function integrates to one:

$B (α) = \frac{\prod_{k = 1}^{K} Γ (α_{k})}{Γ (\sum_{k = 1}^{K} α_{k})}$ $B (α) = \frac{\prod_{k = 1}^{K} Γ (α_{k})}{Γ (\sum_{k = 1}^{K} α_{k})}$

where $Γ (\cdot)$ $Γ (\cdot)$ is the gamma function.

The Dirichlet distribution is useful because it can be used as a conjugate prior distribution for the categorical and multinomial distributions. Its mean (vector) is

$π_{D} = \frac{α}{\sum_{k = 1}^{K} α_{k}} .$ $π_{D} = \frac{α}{\sum_{k = 1}^{K} α_{k}} .$

And it generalizes the case of the Bernoulli distribution with a Beta prior. That is, it can be shown that if the traditional maximum likelihood estimate for the categorical distribution is given by π_ML, then the posterior mean $π_{*}$ $π_{*}$ of a model consisting of a categorical likelihood and a Dirichlet prior has the form of a Dirichlet distribution with mean

$π_{*} = w π_{D} + (1 - w) π_{ML}$ $π_{*} = w π_{D} + (1 - w) π_{ML}$

where

$w = \frac{α_{K}}{α_{K} + n}, α_{K} = \sum_{k = 1}^{K} α_{k},$ $w = \frac{α_{K}}{α_{K} + n}, α_{K} = \sum_{k = 1}^{K} α_{k},$

and where n is the number of examples used to estimate π_ML. The use of the posterior mean value $π_{*}$ $π_{*}$ as the regularized or smoothed estimate to replace π_ML in a categorical probability model is therefore justified under Bayesian principles by the fact that the mean value of the posterior predictive distribution of a categorical model with a Dirichlet prior is equivalent to plugging the posterior mean parameters of the Dirichlet posterior into the categorical probability model, i.e.,

$p (x | D) = \int_{π} Cat (x | π) Dirichlet (π | D) d π = Cat (x; π_{*}) .$ $p (x | D) = \int_{π} Cat (x | π) Dirichlet (π | D) d π = Cat (x; π_{*}) .$

Again the intuitive notion of thinking of each of the elements α_k of the parameter vector α for the Dirichlet as imaginary observations is justified under a Bayesian analysis.

Estimating the Parameters of a Discrete Distribution

Suppose we wish to estimate the parameters of a discrete probability distribution—of which the binary distribution is a special case. Let the probability of a variable being in category k be π_k, and write the parameters of the distribution as the length–k vector π. Encode each example using a one hot vector x_i, i=1,…, N, which is all zero except for one dimension that corresponds to the observed category, where x_i,k=1. The probability of a dataset can be expressed as

$P (x_{1}, \dots, x_{N}; π) = \prod_{i = 1}^{N} \prod_{k = 1}^{K} π_{k}^{x_{i, k}} .$ $P (x_{1}, \dots, x_{N}; π) = \prod_{i = 1}^{N} \prod_{k = 1}^{K} π_{k}^{x_{i, k}} .$

If n_k is the number of times that each class k in the data has been observed, the log-likelihood of the data is

$\log P (n_{1}, \dots, n_{K}; π) = \sum_{k = 1}^{K} n_{k} \log π_{k} .$ $\log P (n_{1}, \dots, n_{K}; π) = \sum_{k = 1}^{K} n_{k} \log π_{k} .$

To ensure that the parameter vector defines a valid probability, the log-likelihood is augmented with a term involving a Lagrange multiplier λ that enforces the constraint that the probabilities sum to one:

$L = \sum_{k = 1}^{K} n_{k} \log π_{k} + λ [1 - \sum_{k = 1}^{K} π_{k}] .$ $L = \sum_{k = 1}^{K} n_{k} \log π_{k} + λ [1 - \sum_{k = 1}^{K} π_{k}] .$

Taking the derivative of this function with respect to λ and setting the result to zero tells us that the sum over the probabilities in our model should be 1 (as desired). We then take the derivative of the function with respect to each parameter and set it to zero, which gives

$\frac{\partial L}{\partial π_{k}} = 0 \Rightarrow n_{k} = λ π_{k} .$ $\frac{\partial L}{\partial π_{k}} = 0 \Rightarrow n_{k} = λ π_{k} .$

We can solve for λ by summing both sides over k:

$\sum_{k = 1}^{K} n_{k} = λ \sum_{k = 1}^{K} π_{k} \Rightarrow λ = \sum_{k = 1}^{K} n_{k} = N .$ $\sum_{k = 1}^{K} n_{k} = λ \sum_{k = 1}^{K} π_{k} \Rightarrow λ = \sum_{k = 1}^{K} n_{k} = N .$

Therefore we can determine that the gradient of the augmented objective function is zero when

$π_{k} = \frac{n_{k}}{N} .$ $π_{k} = \frac{n_{k}}{N} .$

This simple result should be in line with your intuition about how to estimate probabilities.

We discussed above how specifying a Dirichlet prior for the parameters can regularize the estimation problem and compute a smoothed probability $π_{k}^{*}$ $π_{k}^{*}$ . The regularization can equivalently be viewed as imaginary data or counts α_k for each class k, to give an estimate

$π_{k}^{*} = \frac{n_{k} + α_{k}}{N + α_{K}}, α_{K} = \sum_{k = 1}^{K} α_{k},$ $π_{k}^{*} = \frac{n_{k} + α_{k}}{N + α_{K}}, α_{K} = \sum_{k = 1}^{K} α_{k},$

which can also be written

$π_{k}^{*} = [\frac{α_{K}}{N + α_{K}}] (\frac{α_{k}}{α_{K}}) + [\frac{N}{N + α_{K}}] (\frac{n_{k}}{N}) .$ $π_{k}^{*} = [\frac{α_{K}}{N + α_{K}}] (\frac{α_{k}}{α_{K}}) + [\frac{N}{N + α_{K}}] (\frac{n_{k}}{N}) .$

This also follows from the analysis above that expressed the smoothed probability vector $π_{*}$ $π_{*}$ as a weighted combination of the prior probability vector π_D and the maximum likelihood estimate π_ML, $π_{*} = w π_{D} + (1 - w) π_{ML}$ $π_{*} = w π_{D} + (1 - w) π_{ML}$ .

The Gaussian Distribution

The one-dimensional Gaussian probability distribution has the following form:

$P (x; μ, σ) = \frac{1}{σ \sqrt{2 π}} \exp [- \frac{{(x - μ)}^{2}}{2 σ^{2}}],$ $P (x; μ, σ) = \frac{1}{σ \sqrt{2 π}} \exp [- \frac{{(x - μ)}^{2}}{2 σ^{2}}],$

where the parameters of the model are its mean μ and variance σ² (the standard deviation σ is simply the square root of the variance). Given N examples x_i=1,…, N, the maximum likelihood estimates of these parameters are

$μ = \frac{1}{N} \sum_{i = 1}^{N} x_{i}, σ^{2} = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - μ)}^{2} .$ $μ = \frac{1}{N} \sum_{i = 1}^{N} x_{i}, σ^{2} = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - μ)}^{2} .$

When estimating the variance, the equation above is sometimes modified to use N–1 in place of N in the denominator to obtain an unbiased estimate, giving the standard deviation as

$σ = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N} {(x_{i} - μ)}^{2}},$ $σ = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N} {(x_{i} - μ)}^{2}},$

especially with sample sizes less than 10. This is known as the (corrected) sample standard deviation.

The Gaussian distribution can be generalized from one to two dimensions, or indeed to any number of dimensions. Consider a two-dimensional model consisting of independent Gaussian distributions for each dimension, which is equivalent to a model with a diagonal covariance matrix when written using matrix notation. We can transform from scalar to matrix notation for a two-dimensional Gaussian distribution:

$\begin{array}{l} P (x_{1}, x_{2}) & = \frac{1}{\sqrt{2 π} σ_{1}} \exp [- \frac{{(x_{1} - μ_{1})}^{2}}{2 σ_{1}^{2}}] \frac{1}{\sqrt{2 π} σ_{2}} \exp [- \frac{{(x_{2} - μ_{2})}^{2}}{2 σ_{2}^{2}}] \\ = {(2 π)}^{- 1} {(σ_{1}^{2} σ_{2}^{2})}^{- 1 / 2} \exp {- \frac{1}{2} {(x - μ)}^{T} {[\begin{matrix} σ_{1}^{2} & 0 \\ 0 & σ_{2}^{2} \end{matrix}]}^{- 1} (x - μ)} \\ = {(2 π)}^{- 1} {| Σ |}^{- 1 / 2} \exp {- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ)}, \end{array}$ $\begin{array}{l} P (x_{1}, x_{2}) & = \frac{1}{\sqrt{2 π} σ_{1}} \exp [- \frac{{(x_{1} - μ_{1})}^{2}}{2 σ_{1}^{2}}] \frac{1}{\sqrt{2 π} σ_{2}} \exp [- \frac{{(x_{2} - μ_{2})}^{2}}{2 σ_{2}^{2}}] \\ = {(2 π)}^{- 1} {(σ_{1}^{2} σ_{2}^{2})}^{- 1 / 2} \exp {- \frac{1}{2} {(x - μ)}^{T} {[\begin{matrix} σ_{1}^{2} & 0 \\ 0 & σ_{2}^{2} \end{matrix}]}^{- 1} (x - μ)} \\ = {(2 π)}^{- 1} {| Σ |}^{- 1 / 2} \exp {- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ)}, \end{array}$

where the covariance matrix of the model is given by Σ, the vector x=[x₁ x₂]^T, and the mean vector μ=[μ₁ μ₂]^T. This progression of equations is true because the inverse of a diagonal matrix is simply a diagonal matrix consisting of one over each of the original diagonal elements, which explains how the scalar notation converts to the matrix notation for an inverse covariance matrix. The covariance matrix is the matrix with this entry on row i and column j:

$Σ_{i j} = cov (x_{i}, x_{j}) = E [(x_{i} - μ_{i}) (x_{j} - μ_{j})]$ $Σ_{i j} = cov (x_{i}, x_{j}) = E [(x_{i} - μ_{i}) (x_{j} - μ_{j})]$

where E[.] refers to the expected value and $μ_{i} = E [x_{i}]$ $μ_{i} = E [x_{i}]$ . The mean can be computed in vector form:

$μ = \frac{1}{N} \sum_{i = 1}^{N} x_{i} .$ $μ = \frac{1}{N} \sum_{i = 1}^{N} x_{i} .$

The equation for estimating a covariance matrix is

$Σ = \frac{1}{N} \sum_{i = 1}^{N} (x_{i} - μ) {(x_{i} - μ)}^{T} .$ $Σ = \frac{1}{N} \sum_{i = 1}^{N} (x_{i} - μ) {(x_{i} - μ)}^{T} .$

In general the multivariate Gaussian distribution can be written

$P (x_{1}, x_{2}, \dots, x_{d}) = {(2 π)}^{- d / 2} | Σ |^{- 1 / 2} \exp {- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ)} .$ $P (x_{1}, x_{2}, \dots, x_{d}) = {(2 π)}^{- d / 2} | Σ |^{- 1 / 2} \exp {- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ)} .$

When a variable is to be modeled with a Gaussian distribution with mean μ and covariance matrix Σ, it is common to write P(x)=N(x; μ, Σ). Notice the semicolon: this implies that the mean and covariance will be treated as parameters. In contrast, the “|” (or “given”) symbol is used when the parameters are treated as variables and their uncertainty is to be modeled. Treating parameters as random variables is popular in Bayesian techniques such as latent Dirichlet allocation.

Useful Properties of Linear Gaussian Models

Consider a Gaussian random variable x with mean μ and covariance matrix A, $p (x) = N (x; μ, A)$ $p (x) = N (x; μ, A)$ , and a random variable y whose conditional distribution given x is Gaussian with mean Wx+b and covariance matrix B, $p (y | x) = N (y; W x + b, B)$ $p (y | x) = N (y; W x + b, B)$ . The marginal distribution of y and conditional distribution of x given y can be written

$\begin{array}{l} p (y) = N (y; W x + b, B + W A W^{T}), \\ p (x | y) = N (x; C [W^{T} B^{- 1} (y - b) + A^{- 1} μ], C), \end{array}$ $\begin{array}{l} p (y) = N (y; W x + b, B + W A W^{T}), \\ p (x | y) = N (x; C [W^{T} B^{- 1} (y - b) + A^{- 1} μ], C), \end{array}$

respectively, where $C = {(A^{- 1} + W^{T} B^{- 1} W)}^{- 1}$ $C = {(A^{- 1} + W^{T} B^{- 1} W)}^{- 1}$ .

Probabilistic PCA and the Eigenvectors of a Covariance Matrix

When explaining principal component analysis in Section 9.6 we discussed the idea of diagonalizing a covariance matrix Σ and formulated this in terms of finding a matrix of eigenvectors Φ such that $Φ^{T} Σ Φ = Λ$ $Φ^{T} Σ Φ = Λ$ , a diagonal matrix. The same objective could be formulated as finding a factorization of the covariance matrix such that $Σ = Φ Λ Φ^{T}$ $Σ = Φ Λ Φ^{T}$ . Recall that in our presentation of probabilistic PCA in Chapter 9 we saw that the marginal probability for P(x) under principal component analysis involves a covariance matrix given by $Σ = (W^{T} W + σ^{2} I)$ $Σ = (W^{T} W + σ^{2} I)$ . Therefore, when $σ^{2} \to 0$ $σ^{2} \to 0$ we can see that if $W = Φ Λ^{\frac{1}{2}}$ $W = Φ Λ^{\frac{1}{2}}$ we would have precisely the same W that one could obtain from matrix factorization methods based on eigendecomposition. Importantly, for $σ^{2} > 0$ $σ^{2} > 0$ it can be shown that maximum likelihood learning will produce Ws that are not in general orthogonal (Tipping and Bishop, 1999a, 1999b); however, some more recent work has shown how to impose orthogonality constraints during a maximum likelihood–based optimization procedure.

The Exponential Family of Distributions

The exponential family of distributions includes Gaussian, Bernoulli, Binomial, Beta, Gamma, Categorical, Multinomial, Dirichlet, Chi-squared, Exponential and Poisson, among many others. In addition to their commonly used forms, these distributions can all be written in the standardized exponential family form that makes them easy to work with algebraically:

$p (x) = h (x) \exp [θ^{T} T (x) - A (θ)],$ $p (x) = h (x) \exp [θ^{T} T (x) - A (θ)],$

where θ is a vector of natural parameters, T(x) is a vector of sufficient statistics, A(θ) is known as cumulant generating function, and h(x) is an additional function of x. As an example, for the 1D Gaussian distribution these parameters are $θ = {[μ / σ^{2} - 1 / (2 σ^{2})]}^{T}$ $θ = {[μ / σ^{2} - 1 / (2 σ^{2})]}^{T}$ , $T (x) = {[\begin{matrix} x & x^{2} \end{matrix}]}^{T}$ $T (x) = {[\begin{matrix} x & x^{2} \end{matrix}]}^{T}$ , $h (x) = 1 / \sqrt{2 π}$ $h (x) = 1 / \sqrt{2 π}$ , and $A (θ) = μ^{2} / (2 σ^{2}) + \ln | σ |$ $A (θ) = μ^{2} / (2 σ^{2}) + \ln | σ |$ .

Variational Methods and the EM Algorithm

With a complex probability model for which the posterior distribution cannot be computed exactly, a method called variational EM can be used. This involves manipulating approximations to the model’s true posterior distribution during an EM optimization procedure. The following variational analysis also helps to show why and how the EM algorithm involving exact posterior distributions works.

Before we begin, when using variational methods with approximate distributions it is helpful to make a distinction between the parameters used to build an approximation to the true posterior distribution and the parameters of the original model. Consider a probability model with a set of hidden variables H and a set of observed variables X. The observed values are given by $\tilde{X}$ $\tilde{X}$ . Let $p = p (H | \tilde{X}; θ)$ $p = p (H | \tilde{X}; θ)$ be the model’s exact posterior distribution, and $q = q (H | \tilde{X}; Φ)$ $q = q (H | \tilde{X}; Φ)$ be a variational approximation, with a set Φ of variational parameters.

To understand how variational methods are used in practice, we first examine the well-known “variational bound.” This is created using two tricks. The first is to divide and multiply by the same quantity; the second is to apply an inequality known as Jensen’s inequality. These allow the construction of a variational lower bound L(q) on the log marginal likelihood:

$\begin{array}{l} \log p (\tilde{X}; θ) & = \log \sum_{H} p (\tilde{X}, H; θ) \\ = \log \sum_{H} \frac{q (H | \tilde{X}; Φ)}{q (H | \tilde{X}; Φ)} p (\tilde{X}, H; θ) \\ \geq \sum_{H} q (H | \tilde{X}; Φ) \log \frac{p (\tilde{X}, H; θ)}{q (H | \tilde{X}; Φ)} \\ = E {[\log P (\tilde{X}, H; θ)]}_{q} + H (q) \\ = L (q) . \end{array}$ $\begin{array}{l} \log p (\tilde{X}; θ) & = \log \sum_{H} p (\tilde{X}, H; θ) \\ = \log \sum_{H} \frac{q (H | \tilde{X}; Φ)}{q (H | \tilde{X}; Φ)} p (\tilde{X}, H; θ) \\ \geq \sum_{H} q (H | \tilde{X}; Φ) \log \frac{p (\tilde{X}, H; θ)}{q (H | \tilde{X}; Φ)} \\ = E {[\log P (\tilde{X}, H; θ)]}_{q} + H (q) \\ = L (q) . \end{array}$

Here, H(q) is the entropy of q, which is

$H (q) = - \sum_{H} q (H | \tilde{X}; Φ) \log q (H | \tilde{X}; Φ) .$ $H (q) = - \sum_{H} q (H | \tilde{X}; Φ) \log q (H | \tilde{X}; Φ) .$

The bound L(q) becomes an equality when q=p. In the case of “exact” EM, this confirms that each M-step will increase the likelihood of the data. However, to make the lower bound tight again in preparation for the next M-step, the new exact posterior must be recomputed with the updated parameters as a part of the subsequent E-step.

When q is merely an approximation to p, the relationship between the marginal log-likelihood and the expected log-likelihood under distribution q can be written with an equality as opposed to an inequality:

$\begin{array}{l} \log P (\tilde{X}; θ) & = E {[\log P (\tilde{X}, H; θ)]}_{q} + H (q) + D_{KL} (q | | p) \\ = L (q) + D_{KL} (q | | p) . \end{array}$ $\begin{array}{l} \log P (\tilde{X}; θ) & = E {[\log P (\tilde{X}, H; θ)]}_{q} + H (q) + D_{KL} (q | | p) \\ = L (q) + D_{KL} (q | | p) . \end{array}$

KL(q||p) is known as the Kullback–Leibler (KL) divergence, a measure of the distance between distributions q and p. It is not a true distance in the mathematical sense, but rather a quantity that always exceeds zero and only becomes zero when q=p. Here it is given by

$D_{KL} (q | | p) = \sum_{H} q (H | \tilde{X}; Φ) \log \frac{q (H | \tilde{X}; Φ)}{p (H | \tilde{X}; θ)} .$ $D_{KL} (q | | p) = \sum_{H} q (H | \tilde{X}; Φ) \log \frac{q (H | \tilde{X}; Φ)}{p (H | \tilde{X}; θ)} .$

The difference between the log marginal likelihood and the variational bound is given by the KL divergence between the approximate q and the true p. This means that if q is approximate, the bound can be tightened by improving the quality of the approximation q to the true posterior p. So, as we also saw above, when q is not an approximation but equals p exactly, $D_{KL} (q | | p) = 0$ $D_{KL} (q | | p) = 0$ and

$\log P (\tilde{X}; θ) = E {[\log P (\tilde{X}, H; θ)]}_{q} + H (q) .$ $\log P (\tilde{X}; θ) = E {[\log P (\tilde{X}, H; θ)]}_{q} + H (q) .$

Variational inference techniques are often used to improve the quality of an approximate posterior distribution within an EM algorithm, and the term “variational EM” refers to this general method. However, the result of a variational inference procedure is sometimes useful in itself. A key feature of variational methods arises from the existence of the variational bound and the fact that algorithms can be formulated that iteratively bring q closer to p in the sense of the KL divergence.

The mean-field approach is one of the simplest variational methods. It minimizes the KL divergence between an approximation, which consists of giving each variable its own separate variational distribution (and parameters), and the true joint distribution. This is known as a “fully factored variational approximation” and could be written

$q (H | \tilde{X}; Φ) = \prod_{j} q_{j} (h_{j} | \tilde{X}; ϕ_{j}) .$ $q (H | \tilde{X}; Φ) = \prod_{j} q_{j} (h_{j} | \tilde{X}; ϕ_{j}) .$

Given some initial parameters for the separate distributions for each variable q_j=q_j(h_j), one proceeds to update each variable iteratively, given expectations of the model under the current variational approximation for the other variables. These updates take this general form:

$q_{j} (h_{j} | \tilde{X}; ϕ_{j}) = \frac{1}{Z} E {[\log P (X, H; θ)]}_{\prod_{i \neq j} q_{i} (h_{i})},$ $q_{j} (h_{j} | \tilde{X}; ϕ_{j}) = \frac{1}{Z} E {[\log P (X, H; θ)]}_{\prod_{i \neq j} q_{i} (h_{i})},$

where the expectation is performed using the approximate qs for all variables h_i other than h_j, and Z is a normalization constant obtained by summing over the numerator for all values of h_j.

Early work on variational methods for graphical models is well represented in Jordan, Ghahramani, Jaakkola, and Saul (1999). If distributions are placed over parameters as well as hidden variables, variational Bayesian methods and variational Bayesian EM can be used to perform more fully Bayesian learning (Ghahramani and Beal, 2001). Winn and Bishop (2005) gives a good comparison of belief propagation and variational inference methods when viewed as message passing algorithms. Bishop’s textbook (Bishop, 2006), as well as Koller and Friedman (2009)’s, provide further detail and more advanced machine learning techniques based on the variational perspective.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Appendix A. Theoretical foundations

Create new playlist

Sign In

Sign Up