Appendix A

Theoretical foundations

A.1 Matrix Algebra

Basic Manipulations and Properties

A column vector x with d dimensions can be written

x[x1x2xd]=[x1x2xd]T,

image

where the transpose operator, superscript T, allows it be written as a transposed row vector—which is useful when defining vectors in running text. In this book, vectors are assumed to be row vectors.

The transpose AT of a matrix A involves copying all the rows of the original matrix A into the columns of AT. Thus a matrix with m rows and n columns becomes a matrix with n rows and m columns:

A[a11a12a1na21a21a2nam1am2amn]AT=[a11a21am1a12a21am2a1na2nanm].

image

The dot product or inner product between vector x and vector y of the same dimensionality yields a scalar quantity,

x·y=x,y=xTy=i=1Dxiyi.

image

For example, the Euclidean norm can be written as the square root of the dot product of a vector x with itself, ||x||2=xTximage.

The tensor product or outer product ⊗ between an m-dimensional vector x and an n-dimensional vector y yields a matrix,

xyxyT=[x1x2xm][y1y2yn]=[x1y1x1y2x1ynx2y1x2y2x2ynxmy1xmy2xmyn]

image

Given an N row and K column matrix A, and a K row and M column matrix B, if we write each row of A as anTimage and each column of B as bm, the matrix product AB (i.e., matrix multiplication) can be written

AB=[a1Ta2TaNT][b1b2bM]=[a1Tb1a1Tb2a1TbMa2Tb1a2Tb2a2TbMaNTb1aNTb2aNTbM].

image

If we write each column of A as ak and each row of B as bkTimage, the matrix product AB can also be written in terms of tensor products

AB=[a1a2aK][b1Tb2TbKT]=k=1KakbkT.

image

The elementwise product or Hadamard product of two matrices that are of the same size is

AB=[a11a12a1na21a21a2nam1am2amn][b11b12b1nb21b21b2nbm1bm2bmn]=[a11b11a12b12a1nb1na211b21b21b21a2nb2nam1bm1am2bm2amnbmn].

image

A square matrix A with n rows and n columns is invertible if there exists a square matrix B=A−1 such that AB=BA=I, where I is the identity matrix consisting of all zeros except that all diagonal elements are one. Square matrices that are not invertible are called singular. A square matrix A is singular if and only if its determinant, det(A), is zero. The following equation for the inverse shows why a zero determinant implies that a matrix is not invertible:

A1=1det(A)CT,

image

where det(A) is the determinant of A and C is another matrix known as the cofactor matrix. Finally, if a matrix A is orthogonal, then A−1=AT.

Derivatives of Vector and Scalar Functions

Given a scalar function y of an m-dimensional column vector x,

yx[yx1yx2y1xm]=g.

image

This quantity is known as the gradient, g. We have defined it here as a column vector, but it is sometimes defined as a row vector. Defining the gradient as a column vector implies certain orientations for the other quantities defined below, so keep in mind that the derivatives that follow are sometimes defined as the transposes of those given here. Using the definition and orientation above we can write the types of parameter updates frequently used in algorithms like gradient descent in vector form with expressions such as θnew=θoldgimage, where θ is a parameter (column) vector.

Given a scalar x and n-dimensional vector function y,

yx[y1xy2xynx].

image

For an m-dimensional vector x and an n-dimensional vector y, the Jacobian matrix is given by

yx[y1x1y2x1ynx1y1x2y2x2ynx2y1xmy2xmynxm].

image

The Jacobian is sometimes defined as the transpose of this quantitiy even given the other definitions above. Watch out for the implications. The derivative of a scalar function y=f(X) with respect to an m×n dimensional matrix X is known as a gradient matrix and is given by

fX[yx11yx12yx1nyx21yx22yx2nyxm1yxm2yxmn]=G.

image

Our choice for the orientations of these quantities means that the gradient matrix has the same layout as the original matrix, so updates to a parameter matrix X take the form Xnew=XoldGimage.

While many quantities can be expressed as scalars, vectors, or matrices, there are many that cannot. Inspired by the tabular visualization of Minka (2000), the scalar, vector, matrix, and tensor quantities resulting from taking the derivatives of different combinations of quantities are shown in Table A.1.

Table A.1

Quantities That Result From Various Derivatives (After Minka, 2000)

 Scalar fimage Vector fimage Matrix Fimage
Scalar ximage Scalar: fx=gimage Vector: fx[fix]=gTimage Matrix: Fx[fijx]=GTimage
Vector ximage Vector: fx[fxi]=gimage Matrix: fx[fixj]=Gimage Tensor: Fx[Fijxk]image
Matrix Ximage Matrix: fX[fxij]=Gimage Tensor: fX[fixjk]image Tensor: FX[fijxkl]image

Image

The Chain Rule

The chain rule for a function z, which is a function of y, which is a function of x, all of which are scalars, is

zx=zyyx

image

where the two terms could be reversed, because multiplication is commutative. Now, given an m-dimensional vector x, an n-dimensional vector y, and an o-dimensional vector z, if z=z(y(x)), then

zx[z1x1z2x1zox1z1x2z2x2zox2z1xmz2xmzoxm],

image

where each entry in the m×n matrix can be computed using

zixj=k=1nykxjziyk=[yxj][ziy].

image

The vector form could be viewed as

zx=[y1x1y2x1ynx1y1x2y2x2ynx2y1xmy2xmynxm][z1y1z2y1zoy1z1y2z2y2zoy2z1ynz2ynzoyn]zx=yxzy,

image

which yields the chain rule for vectors, where chains extend toward the left as opposed to the right as is often done with the scalar version. For the special case when the final function evaluates to a scalar—which is frequently encountered when optimizing a loss function, we have

zxj=k=1nykxjzyk,zx=yxzy.

image

The rule generalizes, so that if there were yet another vector function w that is a function of x, through z, then

wx=yxzywz.

image

To find the derivative of a matrix that is the function of another matrix, the chain rule generalizes. For example, for a matrix X if matrix Y=f(X), the derivative of a function g(Y) is

g(Y)X=g(f(X))Xg(Y)xij=k=1Kl=1Lg(Y)yklyklxij

image

Computation Graphs and Backpropagation

Computation networks help to show how the gradients required for deep learning with backpropagation can be computed. They also provide the basis for many software packages for deep learning, which partially or fully automate the computations involved.

We begin with an example that computes intermediate quantities that are scalars, and then extend it to networks involving vectors that represent entire layers of variables at each node. Fig. A.1 gives a computation graph that implements the function z1(y1,z2(y2(y1),z3(y3(y2(y1))))), and shows how to compute the gradients. The chain rule for a scalar function a involving intermediate results b1,…,bk that are dependent on c is

a(b1,,bk)c=k=1nabkbkc.

image
image
Figure A.1 Decomposing partial derivatives using a computation graph.

In the example, the partial derivative of z1 with respect to y1 therefore consists of three terms

z1y1=z1y1(1)+z1z2z2y2y2y1(2)+z1z2z2z3z3y3y3y2y2y1(3)=z1y1+z1z2[z2y2+z2z3z3y3y3y2]y2y1

image

The sums needed to compute this involve following back along the flows of the subcomputations that were performed to evaluate the original function. These can be implemented efficiently by passing them between nodes in the graph, as Fig. A.1 shows.

This high-level notion of following flows in a graph generalizes to deep networks involving entire layers of variables. If Fig. A.1 were drawn using a scalar for z1 and vectors for each of the other nodes, the partial derivatives could be replaced by their vector versions. It is necessary to reverse the order of the multiplications, because in the case of partial derivatives of vectors with respect to vectors, computations grow to the left, yielding

z1y1=z1y1+y2y1[z2y2+y3y2z3y3z2z3]z1z2

image

To see this, consider (1) the chain rule for the partial derivative of a scalar function z with a vector x as argument, but involving the computation of an intermediate vector y:

z(y)xj=k=1nykxjzykzx=yxzy=Dd,

image

and (2) the chain rule for the same scalar function z(x), but involving the computation of an intermediate matrix Y:

z(Y)x=l=1Lylxzyl=l=1LDldl.

image

Derivatives of Functions of Vectors and Matrices

Here are some useful derivatives for functions of vectors and matrices. Petersen and Pedersen (2012) gives an even larger list.

xAx=ATxxTx=2xaaTx=axTa=xxxTAx=Ax+ATxAyTAx=yxTx(ax)T(ax)=2(ax)

image

Notice that the first identity above would be equal to simply A, had we defined the Jacobian to be the transpose of our definition above.

For a symmetric matrix C (e.g., an inverse covariance matrix),

a(ab)TC(ab)=2C(ab)b(ab)TC(ab)=2C(ab)w(yAw)TC(yAw)=2ATC(yAw)

image

Vector Taylor Series Expansion, Second-Order Methods, and Learning Rates

The method of gradient descent, the interpretation of learning rates, and more sophisticated second-order methods can be viewed through the lens of the Taylor series expansion of a function. The approach presented below is also known as Newton’s method.

The Taylor expansion of a function near point xo can be written

f(x)=f(xo)+f(xo)1!(xxo)+f(xo)2!(xxo)2+f(3)(xo)3!(xxo)3+

image

Using the approximation up to the second-order (squared) terms in x, taking the derivative, setting the result to zero, and solving for x, gives

0=ddx[f(xo)+f(xo)(xxo)+f(xo)2(xxo)2]=f(xo)+f(xo)(xxo),thussolvingforδx(xxo)δx=f(xo)f(xo),orx=xof(xo)f(xo).

image

This generalizes to the vector version of a Taylor series for a scalar function with matrix arguments, where

f(θ)=f(θo)+goT(θθo)+12(θθo)THo(θθo)+,go=dfdθo,Ho=ddθodfdθo=d2fdθo2.

image

Using the identities given in the previous section, taking the derivative with respect to the parameter vector θ, setting it to zero and then solving for the point at which the quadratic approximation to the function would be zero, yields

df(θ)dθ=0=go+Ho(θθo),Δθ(θθo)Δθ=Ho1go.

image

This means that for the update θnew=θoHo1goimage, the learning rate used for gradient descent can be thought of in terms of a simple diagonal matrix approximation to the inverse Hessian matrix in a second-order method. In other words, using a simple learning rate is analogous to making an approximation, so that Ho1=ηIimage.

Full-blown second-order methods take more effective steps at each iteration. However, they can be expensive because of the need to compute this quantity. For convex problems like logistic regression, a popular second-order method known as L-BFGS builds an approximation to the Hessian. Here, L stands for limited memory and BFGS stands for the inventors of the approach, Broyden–Fletcher–Goldfarb–Shanno. Another family of approaches are known as the conjugate gradient algorithms and involve working with the linear system of equations associated with Hox=goimage when solving for x=Δθimage as opposed to computing the inverse of Hoimage.

One needs to keep in mind that when solving nonconvex problems (e.g., learning for multilayer neural networks) the Hessian is not guaranteed to be positive definite, which means that it may not even be invertible. Consequently the use of heuristic adaptive learning rates and momentum terms remain popular and effective for neural network methods.

Eigenvectors, Eigenvalues, and Covariance Matrices

There is a strong connection between eigenvectors, eigenvalues and diagonalization of a covariance matrix, and the method of principal component analysis. If λ is a scalar eigenvalue of a matrix A, there exists a vector x called an eigenvector of A such that Ax=λx. Define a matrix Φ to consist of eigenvectors in each column, and define Λ as a matrix with the corresponding eigenvalues on the diagonal, then the matrix equation ΑΦ=ΦΛimage defines the eigenvalues and eigenvectors of A.

Many numerical linear algebra software packages (e.g., Matlab) can determine solutions to this equation. If the eigenvectors in Φ are orthogonal, which they are for symmetric matrices, the inverse of Φ equals its transpose, which implies that we could equally well write ΦTΑΦ=Λimage. To find the eigenvectors of a covariance matrix, set A=Σ, the covariance matrix. This yields the definition of the eigenvectors of a covariance matrix Σ as the set of orthogonal vectors stored in a matrix Φ and normalized to have unit length, such that ΦTΣΦ=Λimage. Since Λ is diagonal matrix of eigenvalues, the operation ΦTΣΦimage has been used to diagonalize the covariance matrix.

While these results may seem esoteric at first glance, their use is widespread. For example, in computer vision an eigenanalysis-based principal component analysis for face recognition yields what are known as “eigenfaces.” The general technique is widely used in numerous other contexts, and classic, well-cited eigenanalysis-based papers appear in many diverse fields.

The Singular Value Decomposition

The singular value decomposition is a type of matrix factorization that is widely used in data mining and machine learning settings and is implemented as a core routine in many numerical linear algebra packages. It decomposes a matrix X into the product of three matrices such that X=USVT, where U has orthogonal columns, S is a diagonal matrix containing the singular values (normally) sorted along the diagonal, and V also has orthogonal columns. By keeping only the k largest singular values, this factorization allows the data matrix to be reconstructed in a way that is optimal in a least squares sense for each value of k. For any given k, we could therefore write XUkSkVkTimage. Fig. 9.10 illustrates how this works visually.

In our discussion earlier on eigendecompositions we developed an expression for diagonalizing a covariance matrix Σ using ΦTΣΦ=Λimage, where Φ holds the eigenvectors and Λ is a diagonal matrix of eigenvalues. This amounts to seeking a decomposition of the covariance matrix that factorizes into Σ=ΦΛΦTimage. This reveals the relationship between principal component analysis and the singular value decomposition applied to data stored in the columns of matrix X. We will use the fact that the covariance matrix Σ for mean centered data stored as vectors in the columns of X is simply Σ=XXTimage. Since orthogonal matrices have the property that UUT=I, through the following substitution we can see that the matrix Φ, known as the right singular vectors of X, corresponds to the eigenvectors of the covariance matrix. In other words, to factorize a covariance matrix into Σ=ΦΛΦTimage, mean center the data and perform an singular value decomposition on X. Then the covariance matrix is Σ=XXT=USDTDSTUT=US2UTimage, so U=Φimage and S=Λ12image—in other words, the so-called singular values are the square roots of the eigenvalues.

A.2 Fundamental Elements of Probabilistic Methods

Expectations

The expectation of a discrete random variable X is

E[X]=xxP(X=x),

image

where the sum is over all possible values for X. The conditional expectation for discrete random variable X given random variable Y=y has a similar form

E[X|Y=y]=xxP(X=x|Y=y).

image

Given a probability density function p(x) for a continuous random variable X,

E[X]=xp(x)dx.

image

The empirical expectation of a continuous-valued variable X is obtained by placing a Dirac delta function on each empirical observation or example, and normalizing by the number of examples, to define p(x). The expected value of a matrix is defined as a matrix of expected values.

The expectation of a function of a continuous random variable X and a discrete random variable y is

E[f(X,Y)]=yf(x,Y)p(x,Y=y)dx.

image

Expectations of sums of random variables are equal to sums of expectations, or

E[X+Y]=E[X]+E[Y].

image

If there is a scaling factor s and bias or constant c, that

E[sX+c]=sE[X]+c.

image

The variance is defined as

Var[X]=x(xE[X])2p(X=x)=E[(XE[X])(XE[X])]=E[X22XE[X]+(E[X])2]=E[X2]2E[X]E[X]+(E[X])2=E[X2]E[X]2.

image

The expectation of the product of continuous random variables X and Y with joint probability p(x,y) is given by

E[XY]=xyp(x,y)dxdy.

image

The covariance between X and Y is given by

Cov[X,Y]=E[(XE[X])(YE[Y])]=xy(xE[X])(yE[Y])p(X=x,Y=y)=E[XY]E[X]E[Y].

image

Therefore Cov[X,Y]=0E[XY]=E[X]E[Y]image, and X and Y are said to be uncorrelated. Clearly Cov[X,X]=Var[X]. The covariance matrix for a d-dimensional continuous random variable x is obtained from

Cov[x]=[Cov(x1,x1)Cov(x1,xd)Cov(xd,x1)Cov(xd,xd)].

image

Conjugate Priors

In more fully Bayesian methods one treats both variables and parameters as random quantities. The use of prior distributions over parameters can provide simple and well justified ways to regularize model parameters and avoid overfitting. Applying the Bayesian modeling philosophy and techniques can lead to simple adjustments to traditional maximum likelihood estimates. In particular, the use of a conjugate prior distribution for a parameter in an appropriately defined probability model means that the posterior distribution over that parameter will remain in the same form as the prior. This makes it easy to adapt traditional maximum likelihood estimates for parameters using simple weighted averages of the maximum likelihood estimate and the relevant parameters of the conjugate prior. We will see how this works for the Bernoulli, categorical, and Gaussian distributions below. Other more sophisticated Bayesian manipulations are also simplified through the use of conjugacy.

Bernoulli, Binomial, and Beta Distributions

The Bernoulli probability distribution is defined for binary random variables. Suppose x{0,1}image, the probability of x=1 is given by π and the probability of x=0 is given by 1−π. The probability distribution can be written in the following way

P(x;π)=πx(1π)1x.

image

The binomial distribution generalizes the Bernoulli distribution. It defines the probability for a certain number of successes in a sequence of binary experiments, where the outcome of each experiment is governed by a Bernoulli distribution. The probability of exactly k successes in n experiments under the binomial distribution is

P(k;n,π)=(nk)πk(1π)nk,

image

and defined for k=0, 1, 2,…, n where

(nk)=n!k!(nk)!

image

is the binomial coefficient. Intuitively, the binomial coefficient is needed to account for the fact that the definition of this distribution ignores the order of the results of the experiments—the k results where x=1 could have occurred anywhere in the sequence of the n experiments. The binomial coefficient gives the number of different ways in which one could have obtained the k results where x=1. Intuitively, the term πk is the probability of exactly k results where x=1, and πn−k is the probability of having exactly nk results where x=0. These two terms are valid for each of the possible ways in which the sequence of outcomes could have occurred, and we therefore simply multiply by the number of possibilities.

The Beta distribution is defined for a random variable π where 0π1image. It uses two shape parameters α,β>0image such that

P(π;α,β)=1B(α,β)πα1(1π)β1, (A.1)

image (A.1)

where B(α,β)image is the beta function and serves as a normalization constant that ensures that the function integrates to one. The Beta distribution is useful because it can be used as a conjugate prior distribution for the Bernoulli and binomial distributions. Its mean is

πB=(αα+β),

image

and it can be shown that if the maximum likelihood estimate for the Bernoulli distribution is given by πML then the posterior mean π*image of the Beta distribution is

π*=wπB+(1w)πML,

image

where

w=α+βα+β+n,

image

and n is the number of examples used to estimate πML. The use of the posterior mean value π*image as the regularized or smoothed estimate, replacing πML in a Bernoulli model, is therefore justified under Bayesian principles by the fact that the mean value of the posterior predictive distribution of a Beta-Bernoulli model is equivalent to plugging the posterior mean parameters of the Beta into the Bernoulli, i.e.

p(x|D)=01Bern(x|π)Beta(π|D)dπ=Bern(x;π*).

image

This supports the intuitive notion of thinking of α and β as imaginary observations for x=1 and x=0, respectively, and justifies it in a Bayesian sense.

Categorical, Multinomial, and Dirichlet Distributions

The categorical distribution is defined for discrete random variables with more than two states; it generalizes the Bernoulli distribution. For K categories one might define A{a1,a2,,aK}image or x{1,2,,K}image; however, the order of the integers used to encode the categories is arbitrary. If the probability of x being in state or category k is given by πk, and if we use a one hot encoding for a vector representation x in which all the elements of x are zero except for exactly one dimension that is equal to 1, representing the state or category of x, then the categorical distribution is

P(x;π)=k=1Kπkxk.

image

The multinomial distribution generalizes the categorical distribution. Given multiple independent observations of a discrete random variable with a fixed categorical probability πk for each class k, the multinomial distribution defines the probability of observing a particular number of instances of each category. If the vector x is defined as the number of times each category has been observed, then the multinomial distribution can be expressed as

P(x;n,π)=(n!x1!xK!)k=1Kπkxk.

image

The Dirichlet distribution is defined for a random variable or parameter vector π such that π1,,πK>0image, π1,,πK<1image, π1+π2+,πK=1image, which is precisely the form of π used to define the categorical and multinomial distributions above. The Dirichlet distribution with parameters α1,,αK>0image, K2image is

P(π;α)=1B(α)i=kKπkαk1

image

where B(α), the multinomial beta function, serves as the normalization constant that ensures that the function integrates to one:

B(α)=k=1KΓ(αk)Γ(k=1Kαk)

image

where Γ()image is the gamma function.

The Dirichlet distribution is useful because it can be used as a conjugate prior distribution for the categorical and multinomial distributions. Its mean (vector) is

πD=αk=1Kαk.

image

And it generalizes the case of the Bernoulli distribution with a Beta prior. That is, it can be shown that if the traditional maximum likelihood estimate for the categorical distribution is given by πML, then the posterior mean π*image of a model consisting of a categorical likelihood and a Dirichlet prior has the form of a Dirichlet distribution with mean

π*=wπD+(1w)πML

image

where

w=αKαK+n,αK=k=1Kαk,

image

and where n is the number of examples used to estimate πML. The use of the posterior mean value π*image as the regularized or smoothed estimate to replace πML in a categorical probability model is therefore justified under Bayesian principles by the fact that the mean value of the posterior predictive distribution of a categorical model with a Dirichlet prior is equivalent to plugging the posterior mean parameters of the Dirichlet posterior into the categorical probability model, i.e.,

p(x|D)=πCat(x|π)Dirichlet(π|D)dπ=Cat(x;π*).

image

Again the intuitive notion of thinking of each of the elements αk of the parameter vector α for the Dirichlet as imaginary observations is justified under a Bayesian analysis.

Estimating the Parameters of a Discrete Distribution

Suppose we wish to estimate the parameters of a discrete probability distribution—of which the binary distribution is a special case. Let the probability of a variable being in category k be πk, and write the parameters of the distribution as the length–k vector π. Encode each example using a one hot vector xi, i=1,…, N, which is all zero except for one dimension that corresponds to the observed category, where xi,k=1. The probability of a dataset can be expressed as

P(x1,,xN;π)=i=1Nk=1Kπkxi,k.

image

If nk is the number of times that each class k in the data has been observed, the log-likelihood of the data is

logP(n1,,nK;π)=k=1Knklogπk.

image

To ensure that the parameter vector defines a valid probability, the log-likelihood is augmented with a term involving a Lagrange multiplier λ that enforces the constraint that the probabilities sum to one:

L=k=1Knklogπk+λ[1k=1Kπk].

image

Taking the derivative of this function with respect to λ and setting the result to zero tells us that the sum over the probabilities in our model should be 1 (as desired). We then take the derivative of the function with respect to each parameter and set it to zero, which gives

Lπk=0nk=λπk.

image

We can solve for λ by summing both sides over k:

k=1Knk=λk=1Kπkλ=k=1Knk=N.

image

Therefore we can determine that the gradient of the augmented objective function is zero when

πk=nkN.

image

This simple result should be in line with your intuition about how to estimate probabilities.

We discussed above how specifying a Dirichlet prior for the parameters can regularize the estimation problem and compute a smoothed probability πk*image. The regularization can equivalently be viewed as imaginary data or counts αk for each class k, to give an estimate

πk*=nk+αkN+αK,αK=k=1Kαk,

image

which can also be written

πk*=[αKN+αK](αkαK)+[NN+αK](nkN).

image

This also follows from the analysis above that expressed the smoothed probability vector π*image as a weighted combination of the prior probability vector πD and the maximum likelihood estimate πML, π*=wπD+(1w)πMLimage.

The Gaussian Distribution

The one-dimensional Gaussian probability distribution has the following form:

P(x;μ,σ)=1σ2πexp[(xμ)22σ2],

image

where the parameters of the model are its mean μ and variance σ2 (the standard deviation σ is simply the square root of the variance). Given N examples xi=1,…, N, the maximum likelihood estimates of these parameters are

μ=1Ni=1Nxi,σ2=1Ni=1N(xiμ)2.

image

When estimating the variance, the equation above is sometimes modified to use N–1 in place of N in the denominator to obtain an unbiased estimate, giving the standard deviation as

σ=1N1i=1N(xiμ)2,

image

especially with sample sizes less than 10. This is known as the (corrected) sample standard deviation.

The Gaussian distribution can be generalized from one to two dimensions, or indeed to any number of dimensions. Consider a two-dimensional model consisting of independent Gaussian distributions for each dimension, which is equivalent to a model with a diagonal covariance matrix when written using matrix notation. We can transform from scalar to matrix notation for a two-dimensional Gaussian distribution:

P(x1,x2)=12πσ1exp[(x1μ1)22σ12]12πσ2exp[(x2μ2)22σ22]=(2π)1(σ12σ22)1/2exp{12(xμ)T[σ1200σ22]1(xμ)}=(2π)1|Σ|1/2exp{12(xμ)TΣ1(xμ)},

image

where the covariance matrix of the model is given by Σ, the vector x=[x1 x2]T, and the mean vector μ=[μ1 μ2]T. This progression of equations is true because the inverse of a diagonal matrix is simply a diagonal matrix consisting of one over each of the original diagonal elements, which explains how the scalar notation converts to the matrix notation for an inverse covariance matrix. The covariance matrix is the matrix with this entry on row i and column j:

Σij=cov(xi,xj)=E[(xiμi)(xjμj)]

image

where E[.] refers to the expected value and μi=E[xi]image. The mean can be computed in vector form:

μ=1Ni=1Nxi.

image

The equation for estimating a covariance matrix is

Σ=1Ni=1N(xiμ)(xiμ)T.

image

In general the multivariate Gaussian distribution can be written

P(x1,x2,,xd)=(2π)d/2|Σ|1/2exp{12(xμ)TΣ1(xμ)}.

image

When a variable is to be modeled with a Gaussian distribution with mean μ and covariance matrix Σ, it is common to write P(x)=N(x; μ, Σ). Notice the semicolon: this implies that the mean and covariance will be treated as parameters. In contrast, the “|” (or “given”) symbol is used when the parameters are treated as variables and their uncertainty is to be modeled. Treating parameters as random variables is popular in Bayesian techniques such as latent Dirichlet allocation.

Useful Properties of Linear Gaussian Models

Consider a Gaussian random variable x with mean μ and covariance matrix A, p(x)=N(x;μ,A)image, and a random variable y whose conditional distribution given x is Gaussian with mean Wx+b and covariance matrix B, p(y|x)=N(y;Wx+b,B)image. The marginal distribution of y and conditional distribution of x given y can be written

p(y)=N(y;Wx+b,B+WAWT),p(x|y)=N(x;C[WTB1(yb)+A1μ],C),

image

respectively, where C=(A1+WTB1W)1image.

Probabilistic PCA and the Eigenvectors of a Covariance Matrix

When explaining principal component analysis in Section 9.6 we discussed the idea of diagonalizing a covariance matrix Σ and formulated this in terms of finding a matrix of eigenvectors Φ such that ΦTΣΦ=Λimage, a diagonal matrix. The same objective could be formulated as finding a factorization of the covariance matrix such that Σ=ΦΛΦTimage. Recall that in our presentation of probabilistic PCA in Chapter 9 we saw that the marginal probability for P(x) under principal component analysis involves a covariance matrix given by Σ=(WTW+σ2I)image. Therefore, when σ20image we can see that if W=ΦΛ12image we would have precisely the same W that one could obtain from matrix factorization methods based on eigendecomposition. Importantly, for σ2>0image it can be shown that maximum likelihood learning will produce Ws that are not in general orthogonal (Tipping and Bishop, 1999a, 1999b); however, some more recent work has shown how to impose orthogonality constraints during a maximum likelihood–based optimization procedure.

The Exponential Family of Distributions

The exponential family of distributions includes Gaussian, Bernoulli, Binomial, Beta, Gamma, Categorical, Multinomial, Dirichlet, Chi-squared, Exponential and Poisson, among many others. In addition to their commonly used forms, these distributions can all be written in the standardized exponential family form that makes them easy to work with algebraically:

p(x)=h(x)exp[θTT(x)A(θ)],

image

where θ is a vector of natural parameters, T(x) is a vector of sufficient statistics, A(θ) is known as cumulant generating function, and h(x) is an additional function of x. As an example, for the 1D Gaussian distribution these parameters are θ=[μ/σ21/(2σ2)]Timage, T(x)=[xx2]Timage, h(x)=1/2πimage, and A(θ)=μ2/(2σ2)+ln|σ|image.

Variational Methods and the EM Algorithm

With a complex probability model for which the posterior distribution cannot be computed exactly, a method called variational EM can be used. This involves manipulating approximations to the model’s true posterior distribution during an EM optimization procedure. The following variational analysis also helps to show why and how the EM algorithm involving exact posterior distributions works.

Before we begin, when using variational methods with approximate distributions it is helpful to make a distinction between the parameters used to build an approximation to the true posterior distribution and the parameters of the original model. Consider a probability model with a set of hidden variables H and a set of observed variables X. The observed values are given by X˜image. Let p=p(H|X˜;θ)image be the model’s exact posterior distribution, and q=q(H|X˜;Φ)image be a variational approximation, with a set Φ of variational parameters.

To understand how variational methods are used in practice, we first examine the well-known “variational bound.” This is created using two tricks. The first is to divide and multiply by the same quantity; the second is to apply an inequality known as Jensen’s inequality. These allow the construction of a variational lower bound L(q) on the log marginal likelihood:

logp(X˜;θ)=logHp(X˜,H;θ)=logHq(H|X˜;Φ)q(H|X˜;Φ)p(X˜,H;θ)Hq(H|X˜;Φ)logp(X˜,H;θ)q(H|X˜;Φ)=E[logP(X˜,H;θ)]q+H(q)=L(q).

image

Here, H(q) is the entropy of q, which is

H(q)=Hq(H|X˜;Φ)logq(H|X˜;Φ).

image

The bound L(q) becomes an equality when q=p. In the case of “exact” EM, this confirms that each M-step will increase the likelihood of the data. However, to make the lower bound tight again in preparation for the next M-step, the new exact posterior must be recomputed with the updated parameters as a part of the subsequent E-step.

When q is merely an approximation to p, the relationship between the marginal log-likelihood and the expected log-likelihood under distribution q can be written with an equality as opposed to an inequality:

logP(X˜;θ)=E[logP(X˜,H;θ)]q+H(q)+DKL(q||p)=L(q)+DKL(q||p).

image

KL(q||p) is known as the Kullback–Leibler (KL) divergence, a measure of the distance between distributions q and p. It is not a true distance in the mathematical sense, but rather a quantity that always exceeds zero and only becomes zero when q=p. Here it is given by

DKL(q||p)=Hq(H|X˜;Φ)logq(H|X˜;Φ)p(H|X˜;θ).

image

The difference between the log marginal likelihood and the variational bound is given by the KL divergence between the approximate q and the true p. This means that if q is approximate, the bound can be tightened by improving the quality of the approximation q to the true posterior p. So, as we also saw above, when q is not an approximation but equals p exactly, DKL(q||p)=0image and

logP(X˜;θ)=E[logP(X˜,H;θ)]q+H(q).

image

Variational inference techniques are often used to improve the quality of an approximate posterior distribution within an EM algorithm, and the term “variational EM” refers to this general method. However, the result of a variational inference procedure is sometimes useful in itself. A key feature of variational methods arises from the existence of the variational bound and the fact that algorithms can be formulated that iteratively bring q closer to p in the sense of the KL divergence.

The mean-field approach is one of the simplest variational methods. It minimizes the KL divergence between an approximation, which consists of giving each variable its own separate variational distribution (and parameters), and the true joint distribution. This is known as a “fully factored variational approximation” and could be written

q(H|X˜;Φ)=jqj(hj|X˜;ϕj).

image

Given some initial parameters for the separate distributions for each variable qj=qj(hj), one proceeds to update each variable iteratively, given expectations of the model under the current variational approximation for the other variables. These updates take this general form:

qj(hj|X˜;ϕj)=1ZE[logP(X,H;θ)]ijqi(hi),

image

where the expectation is performed using the approximate qs for all variables hi other than hj, and Z is a normalization constant obtained by summing over the numerator for all values of hj.

Early work on variational methods for graphical models is well represented in Jordan, Ghahramani, Jaakkola, and Saul (1999). If distributions are placed over parameters as well as hidden variables, variational Bayesian methods and variational Bayesian EM can be used to perform more fully Bayesian learning (Ghahramani and Beal, 2001). Winn and Bishop (2005) gives a good comparison of belief propagation and variational inference methods when viewed as message passing algorithms. Bishop’s textbook (Bishop, 2006), as well as Koller and Friedman (2009)’s, provide further detail and more advanced machine learning techniques based on the variational perspective.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.197.251