12. Dimensional Reduction and Latent Variable Models

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

12. Dimensional Reduction and Latent Variable Models

12.1 Introduction

Now that you have the tools for exploring graphical models, we’ll cover a few useful ones. We’ll start with factor analysis, which finds application in the social science and in recommender systems. We’ll move on to a related model, principal components analysis, and explain how it’s useful for solving the collinearity problem in multiple regression. We’ll end with ICA, which is good for separating signals that have blended together. We’ll explain its application on some psychometric data.

One thing all of these models have in common is that they’re latent variable models. That means in addition to the measured variables, there are some unobserved variables that underlie the data. You’ll understand what this means more deeply in the context of these models.

We’ll only really scratch the surface of these models. You can find a lot more detail in Murphy’s Machine Learning: A Probabilistic Perspective [24]. Before we go into a few models, let’s explore an important concept: a “prior” for a variable.

12.2 Priors

Let’s talk about making a measurement of a click-through rate on a website. When you show a link to another section of the website, it’s called an impression of that link. When a user receives an impression, they can either click the link or not click it. A click will be considered a “success” and a noisy indication that the user likes the content of the page. A nonclick will be considered a “failure” and a noisy indication that the user doesn’t like the content. The click-through rate is the probability that a person will click the link when they receive an impression, p = P(C = 1|I = 1). You can estimate it by looking at $\hat{p} = c l i c k s / i m p r e s s i o n s$ $\hat{p} = c l i c k s / i m p r e s s i o n s$ .

If you serve one impression of a link and the user doesn’t click it, your estimate for $\hat{p}$ $\hat{p}$ will be 0/1 = 0. You’d guess that every following user will definitely not click the link! Clearly, that is a pathological result and can cause problems. What if your recommendation system was designed to remove links that performed below a certain level? You’d have some real trouble keeping content on the site.

Fortunately, there’s a less pathological way to learn the click-through rate. You can take a prior on the click rate. To do this, you look at the distribution of all of the past values of the click rates (even better, fit a distribution to it with MCMC!). Then, having no data on a click rate, you at least know that it’s drawn from that past distribution, as shown in Figure 12.1.

A graph represents the distribution of past click rates. — Figure 12.1 The distribution of click-through rates for past hyperlinks

The horizontal axis represents beta that ranges from 0.00 to 0.05 in increments of 0.01 and the vertical axis represents probability density of P cap that ranges from 0 to 100 in increments of 20. The graph shows the distribution line that starts at the point (0.00, 0), gradually inclines up to (0.01, 98) and declines to the point (0.02, 0). The line then shows a study phase along the horizontal axis and ends at the point (0, 0.05). Note: The values are estimated.

This distribution represents your prior knowledge of the click-through rate, in the sense that it’s what you know before you collect any data. You know here, for example, that you’ve never seen a link with a click rate as high as p = 0.5. It’s extremely unlikely that the current link performs that well.

Now, as you collect data, you want to update your prior with new information. The resulting distribution will be called a posterior density for p. Suppose the true value is p = 0.2. After 20 impressions you get a posterior that looks like Figure 12.2. You can see the distribution is narrow, so you’re slightly more sure that the true value lies in a smaller range. The distribution has also shifted more to the right, closer to the true value of p = 0.2. This represents the state of our knowledge after incorporating what you’ve learned from the new data with what you knew from the past data!

A graph represents the posterior density after 20 impressions. — Figure 12.2 Your confidence in the click rate of the current item, given your knowledge of past click rates as well as your new data

The horizontal axis represents P cap that ranges from 0.00 to 0.05 in increments of 0.01 and the vertical axis represents probability density of P cap that ranges from 0 to 70 in increments of 10. The graph shows the distribution line that starts at the point (0.00, 0), gradually inclines at the point (0.005, 1) and reach the peak at the point (0.015, 70) and then declines to the point (0.04, 0). The line then shows a study phase along the horizontal axis and ends at the point (0, 0.05). Note: The values are estimated.

Mathematically, the way this works is that you say the prior, P(p), takes on some distribution that you’ve fit using past data. In this case, it’s a beta distribution with parameters to give an average value of around 0.1, with a little variance around it. You can denote this by p Beta (α, β), which you can read as “p is drawn from a beta distribution with parameters α and β.”

Next, you need a model for the data. You can say that an impression is an opportunity for a click, and the click is a “success.” That makes the clicks a binomial random variable, with success probability p and I trials: one for each impression. You can say that it takes the distribution P(C|p, I) and that C|p, I Bin(I, p). Knowing p and I, C is drawn from a binomial distribution with parameters I and p.

Let’s assume impressions are a fixed parameter, I. You can say the distribution for p given I and C (our data) is $P (p | I C) = \frac{P (C | p, I) P (p)}{P (C)} = \frac{P (C | p, I) P (p)}{\int P (C | p, I) P (p) d p}$ $P (p | I C) = \frac{P (C | p, I) P (p)}{P (C)} = \frac{P (C | p, I) P (p)}{\int P (C | p, I) P (p) d p}$ (by Bayes’ theorem and the chain rule for conditional probability). You can see that the data, C and I, is used to update the prior, P(p) by multiplication to get P(p|I,C).

It’s a fun exercise to derive this posterior and see what distribution it simplifies to!

12.3 Factor Analysis

Factor analysis tries to model N observed k-dimensional vectors, x_i, by describing each data point with a smaller set of f < k unmeasured (latent) variables. You write the data points as being drawn from a distribution, as follows:

\begin{matrix} p (x_{i} | z_{i}, θ) = N (Wz + μ, ψ), & (12.1) \end{matrix}

$\begin{matrix} p (x_{i} | z_{i}, θ) = N (Wz + μ, ψ), & (12.1) \end{matrix}$

where θ represents all of the parameters θ = (W, z, μ). W is a k by f matrix, where there are f latent variables. μ is a global mean for the x_i, and ψ is the k by k covariance matrix.

The matrix W describes how each factor in z contributes to the values of the components of x_i. It’s said to describe how much the factors load on to the components of x_i and so is called the factor loading matrix. Given a vector of the values of the latent variables that correspond to a data point x, W transforms z to the (mean-centered) expected value of x.

The point of this is to use a model that is simpler than the data, so the major simplifying assumption that you’ll make is that the matrix ψ is diagonal. Note that this doesn’t mean that the covariance matrix for x_i is diagonal! It’s the conditional covariance matrix, when you know z_i, which is diagonal. This is going to mean that z_i is going to account for the covariance structure in x_i.

Importantly, the latent factors z characterize each data point. It’s a condensed representation of the information in the data. This is where factor analysis is useful. You can map from high-dimensional data down to much lower dimensional data and explain most of the variance using much less information.

You use a normal prior for the z variables in factor analysis, z_i∽ $N$ $N$ (μ₀, Σ₀). This makes it so it’s easy to calculate the posterior for x_i. You’ll denote the PDF of the normal distribution with mean μ and covariance matrix Σ as a function of x as $N$ $N$ (x; μ, Σ). Then, you can find the posterior for x_i as follows:

\begin{matrix} p (x_{i} | θ) & = \int p (x_{i} | z_{i}, θ) p (z_{i} | θ) d z_{i} \\ = \int N (x_{i}; {Wz}_{i} + μ, ψ) N (z_{i}; μ_{0}, Σ_{0}) (12.2) \\ = N (x_{i}; w_{μ 0} + μ, ψ + w Σ_{0} w^{T}) \end{matrix}

$\begin{matrix} p (x_{i} | θ) & = \int p (x_{i} | z_{i}, θ) p (z_{i} | θ) d z_{i} \\ = \int N (x_{i}; {Wz}_{i} + μ, ψ) N (z_{i}; μ_{0}, Σ_{0}) (12.2) \\ = N (x_{i}; w_{μ 0} + μ, ψ + w Σ_{0} w^{T}) \end{matrix}$

You can actually get rid of the Wμ₀ term by absorbing it into the fitted parameter μ. You can also turn the Σ₀ term into I, since you’re free to absorb the Σ₀ into W by defining $w' = w Σ_{0}^{- 1 / 2}$ $w' = w Σ_{0}^{- 1 / 2}$ . This lets you fix id="equ0" id="equ0" $w Σ_{0} w^{T} = w' {w'}^{T}$ $w Σ_{0} w^{T} = w' {w'}^{T}$ .

From this analysis, you can see that you’re modeling the covariance matrix of the x_i using this lower-rank matrix W, and the diagonal matrix Ψ. You can write the approximate covariance matrix for the x_i as follows:

\begin{matrix} C o v (x_{i}) ≅ {WW}^{T} + Ψ & (12.3) \end{matrix}

$\begin{matrix} C o v (x_{i}) ≅ {WW}^{T} + Ψ & (12.3) \end{matrix}$

Factor analysis has been applied often in the social sciences and finance. It can be good for reducing a complicated problem into a simpler, more interpretable one. An example is where x_i is the difference between the true returns on an asset from the expected returns. Then, the factors z_i are the risk factors, and the weights W determine the asset’s sensitivity to the risk factors. Some of these factors might include inflation risk and market risk [25].

12.4 Principal Components Analysis

Principal components analysis (PCA) is a nice algorithm for dimensional reduction. It’s actually just a special case of factor analysis. If you constrain Ψ = σ²I, let W be orthonormal, and let σ² → 0, then you have PCA. If you let σ² be non-zero, then you have probabilistic PCA.

PCA is useful because it projects the data onto the principal components of the data set. The principal components are the eigenvectors of the covariance matrix. The first principal component is the eigenvector corresponding to the largest eigenvalue, the second component corresponds to the second largest, and so on.

The eigenvalues have a nice property in that they’re uncorrelated. This results in the projected data having no covariance, so you can use this as a preprocessing step when doing regression to avoid the collinearity problem. The principal components are arranged such that the first eigenvector (principal component) accounts for the most variance in the data set, the second one, the second most, and so on. The interpretation of this is that you can look at the mapping of your data set by a few principal components to capture most of the variance (see, for example [26], pages 485–6).

It’s easy to generate some multivariate normal data for an example.

Table of Contents for 12. Dimensional Reduction and Latent Variable Models

Create new playlist

Sign In

Sign Up

12. Dimensional Reduction and Latent Variable Models

12.1 Introduction

12.2 Priors

12.3 Factor Analysis

12.4 Principal Components Analysis

12.4.1 Complexity

12.4.2 Memory Considerations

12.4.3 Tools

12.5 Independent Component Analysis

12.5.1 Assumptions

12.5.2 Complexity

12.5.3 Memory Considerations

12.5.4 Tools

12.6 Latent Dirichlet Allocation

12.7 Conclusion

Table of Contents for
12. Dimensional Reduction and Latent Variable Models