Matrix decomposition is a fundamental tool in linear algebra for understanding the action of a matrix, establishing its suitability to solve a problem, and for solving linear systems more efficiently and effectively. We have encountered an important decomposition already, the eigendecomposition for symmetric matrices (see Section 7.5). The topic of this chapter, the singular value decomposition (SVD), is a tool for more general, even nonsquare matrices. Figure 16.1 demonstrates one application of SVD, image compression.
This chapter allows us to revisit several themes from past chapters: eigenvalues and eigenvectors, the condition number, the least squares solution to an overdetermined system, and more! It provides a good review of some important ideas in linear algebra.
Let A be a 2 × 2 matrix, nonsingular for now. Let v1 and v2 be two unit vectors that are perpendicular to each other, thus they are orthonormal. This means that V = [v1 v2] is an orthogonal matrix: V−1 = VT. In general, A will not map two orthonormal vectors v1, v2 to two orthonormal image vectors u1, u2. However, it is possible for A to map two particular vectors v1 and v2 to two orthogonal image vectors σ1u1 and σ2u2. The two vectors u1 and u2 are assumed to be of unit length, i.e., they are orthonormal.
We formalize this as
with an orthogonal matrix U = [u1 u2] and a diagonal matrix
If Avi = σui, then each ui is parallel to Avi, and A preserves the orthogonality of the vi.
We now conclude from (16.1) that
This is the singular value decomposition, SVD for short, of A. The diagonal elements of Σ are called the singular values of A. Let’s now find out how to determine U, Σ, V.
In Section 7.6, we established that symmetric positive definite matrices, such as ATA, have real and positive eigenvalues and their eigenvectors are orthogonal. Considering the SVD in (16.2), we can write
where
Equation (16.3) states the following: The symmetric positive definite matrix ATA has eigenvalues that are the diagonal entries of Λ′ and eigenvectors as columns of V, which are called the right singular vectors of A. This is the eigendecomposition of ATA.
The symmetric positive definite matrix AAT also has an eigende-composition,
observing that ΣTΣ = ΣΣT. Equation (16.4) states that the symmetric positive definite matrix AAT has eigenvalues that are the diagonal entries of Λ′ and eigenvectors as the columns of U, and they are called the left singular vectors of A.
It should be no surprise that Λ′ appears in both (16.3) and (16.4) since a matrix and its transpose share the same (nonzero) eigenvalues.
Now we understand a bit more about the elements of the SVD in (16.2): the singular values, σi, of A are the square roots of the eigenvalues of ATA and AAT, that is
The columns of V are the eigenvectors of ATA and the columns of U are the eigenvectors of AAT. Observe from (16.1) that we can compute .
Let’s start with a very simple example:
a symmetric, positive definite matrix that scales in the e1-direction. Then
and we can easily calculate the eigenvalues of ATA as and . This means that σ1 =3 and σ2 = 1. The eigenvectors of AAT and ATA are identical and happen to be the columns of the identity matrix for this simple example,
Now we can form the SVD of A, A = UΣVT:
This says that the action of A on a vector x, that is Ax, simply amounts to a scaling in the e1-direction, which we observed to begin with!
Notice that the eigenvalues of A are identical to the singular values. In fact, because this matrix is positive definite, the SVD is identical to the eigendecomposition.
Throughout Chapter 5 we examined the action of a 2 × 2 matrix using an illustration of the circular Phoenix mapping to an elliptical Phoenix. This action ellipse can now be described more precisely: the semi-major axis has length σ1 and the semi-minor axis has length σ2. Figure 16.2 illustrates this point for Examples 16.1 and 16.2. In that figure, you see: the semi-axes, the map of [1 0]T (thick point), and the map of [0 1]T (thin point).
This example is a little more interesting, as the matrix is now a shear,
We compute
and we observe that these two matrices are no longer identical, but they are both symmetric. As they are 2 × 2 matrices, we can easily calculate the eigenvalues as and . (Remember: the nonzero eigenvalues are the same for a matrix and its transpose.) These eigenvalues result in singular values σ1 = 2.41 and σ2 = 0.41. The eigenvectors of ATA are the orthonormal column vectors of
and the eigenvectors of AAT are the orthonormal column vectors of
Now we can form the SVD of A, A = UΣVT:
Figure 16.3 will help us break down the action of A in terms of the SVD. It is now clear that V and U are rotation or reflection matrices and Σ scales, deforming the circle into an ellipse.
Notice that the eigenvalues of A are λ1 = λ2 = 1, making the point that, in general, the singular values are not the eigenvalues!
Now we come full circle and look at what we have solved in terms of our original question that was encapsulated by (16.1): What orthonormal vectors vi are mapped to orthogonal vectors σiui? The SVD provides a solution to this question by providing V, U, and Σ. Furthermore, note that for this nonsingular case, the columns of V form a basis for the row space of A and the columns of U form a basis for the column space of A.
It should be clear that the SVD is not limited to invertible 2 × 2 matrices, so let’s look at the SVD more generally.
The SVD development of the previous section assumed that A was square and invertible. This had the effect that ATA had nonzero eigenvalues and well-defined eigenvectors. However, everything still works if A is neither square nor invertible. Just a few more aspects of the decomposition come into play.
For the general case, A will be a rectangular matrix with m rows and n columns mapping to . As a result of this freedom in the dimensions, the matrices of the SVD (16.2) must be modified. Illustrated in Figure 16.4 is each scenario, m < n, m = n, or m > n. The matrix dimensions are as follows: A is m × n, U is m × m, Σ is m × n, and VT is n × n. The matrix Λ′ in (16.3) is n × n and in (16.4) is m × m, but they still hold the same nonzero eigenvalues because the rank of a matrix cannot exceed min{m, n}.
Again we ask, what orthonormal vectors vi are mapped by A to orthogonal vectors σiui, where the ui are orthonormal? This is encapsulated in (16.1). In the general case as well, the matrices U and V form bases, however, as we are considering rectangular and singular matrices, the rank r of A plays a part in the interpretation of the SVD. The following are the main SVD properties, for a detailed exposition, see Strang [16].
Examples will make this clearer.
Let A be given by
The first step is to form ATA and AAT and find their eigenvalues and (normalized) eigenvectors, which make up the columns of an orthogonal matrix.
The rank of A is 2, thus there are two singular values, and
The SVD of A = UΣVT:
Figure 16.5 illustrates the elements of the SVD and the action of A.
Because m > n, u3 is in the null space of AT, that is ATu3 = 0.
For a matrix of different dimensions, we pick
The first step is to form ATA and AAT and find their eigenvalues and (normalized) eigenvectors, which are made by the columns of an orthogonal matrix.
The matrix A is rank 2, thus there are two singular values, and
The SVD of A = UΣVT:
Figure 16.6 illustrates the elements of the SVD and the action of A.
Because m < n, v3 is in the null space of A, that is, Av3 = 0.
Let’s look at a fairly simple example, a rank deficient matrix,
which is a projection into the [e1, e2]-plane. Because A is symmetric and idempotent, A = ATA = AAT. It is easy to see that A = U ΣVT with
Since the rank is two, the first two columns of U and V form an orthonormal basis for the column and row spaces of A, respectively. The e3 vector is projected to the zero vector by A, and thus this vector spans the null space of A and AT.
Generalizing the 2 × 2 case, the σi are the lengths of the semi-axes of an ellipsoid. As before, the semi-major axis is length σ1 and the length of the semi-minor axis is equal to the smallest singular value.
The SVD is an important tool for dealing with rank deficient matrices.
This section is titled “Steps” rather than “Algorithm” to emphasize that the description here is simply a review of our introduction to the SVD. A robust algorithm that is efficient in terms of computing time and storage will be found in an advanced numerical methods text.
Let’s summarize the steps for finding the SVD of a matrix A.
You have now found the singular valued decomposition of A, which is A = UΣVT.
A note on U: instead of finding the columns as the eigenvectors of ATA, one can compute ui,i = 1, r as . If m > n then the remaining ui are found from the null space of AT.
The only “hard” task in this is finding the . But there are several highly efficient algorithms for this task, taking advantage of the fact that ATA is symmetric. Many of these algorithms will return the corresponding eigenvectors as well.
As we discussed in the context of least squares in Section 13.1, forming ATA can result in an ill-posed problem because the condition number of this matrix is the square of the condition number of A. Thus, numerical methods will avoid direct computation of this matrix by employing a method such as Householder.
As a practical application, we will use the SVD to compute the determinant of a matrix A. We observe that in (16.2), the matrices U and V, being orthogonal, have determinants equal to ±1. Thus,
If a 2D triangle has area φ, it will have area ±σ1σ2φ after being transformed by a 2D linear map with singular values σ1,σ2. Similarly, if a 3D object has volume φ, it will have volume ±σ1σ2σ3 after being transformed by a linear map with singular values σ1, σ2, σ3.
Of course one can compute determinants without using singular values. Recall from Section 15.1, the characteristic polynomial of A,
Evaluating p at λ = 0, we have
where the λi are A’s eigenvalues.
The inverse of a matrix, introduced in Sections 5.9 and 12.4, is mainly a theoretical tool for analyzing the solution to a linear system. Additionally, the inverse is limited to square, nonsingular matrices. What might the inverse of a more general type of matrix be? The answer is in the form of a generalized inverse, or the so-called pseudoinverse, and it is denoted as A† (“A dagger”). The SVD is a very nice tool for finding the pseudoinverse, and it is suited for practical use as we shall see in Section 16.6.
Let’s start with a special matrix, an m × n diagonal matrix Σ with diagonal elements σi. The pseudoinverse, Σ†, has diagonal elements given by
and its dimension is n × m. If the rank of Σ is r, then the product Σ†Σ holds the r × r identity matrix, and all other elements are zero.
We can use this very simple expression for the pseudoinverse of Σ to express the pseudoinverse for a general m × n matrix A using its SVD,
recalling that U and V are orthogonal matrices.
If A is square and invertible, then A† = A−1. Otherwise, A† still has some properties of an inverse:
Let’s find the pseudoinverse of the matrix from Example 16.3,
We find
then (16.7) results in
And we check that (16.8) holds:
and also (16.9):
The matrix from Example 16.1 is square and nonsingular, therefore the pseudoinverse is equal to the inverse. Just by visual inspection:
and the pseudoinverse is
This generalization of the inverse of a matrix is often times called the Moore-Penrose generalized inverse. Least squares approximation is a primary application for this pseudoinverse.
The pseudoinverse allows for a concise approximate solution, specifically the least squares solution, to an overdetermined linear system. (A detailed introduction to least squares may be found in Section 12.7.)
In an overdetermined linear system
A is a rectangular matrix with dimension m × n, m > n. This means that the linear system has m equations in n unknowns and it is inconsistent because it is unlikely that b lives in the subspace V defined by the columns of A. The least squares solution finds the orthogonal projection of b into V, which we will call b′. Thus the solution to Ax = b′ produces the vector closest to b that lives in V. This leads us to the normal equations
whose solution minimizes
In our introduction to condition numbers in Section 13.4, we discussed that this system can be ill-posed because the condition number of ATA is the square of that of A. To avoid forming ATA, in Section 13.1 we proposed the Householder algorithm as a method to find the least squares solution.
As announced at the onset of this section, the SVD and pseudoinverse provide a numerically stable method for finding the least squares solution to an overdetermined linear system. We find an approximate solution rather easily:
But why is this the least squares solution?
Again, we want to find x to minimize (16.11). Let’s frame the linear system in terms of the SVD of A and take advantage of the orthogonality of U,
This new framing of the problem exposes that
and leaves us with an easier diagonal least squares problem to solve. The steps involved are as follows.
Compute the n × 1 vector y = Σ†z.
This is the least squares solution to the m × n problem Σy = z. The least squares solution requires minimizing v = Σy − z, which has the simple form:
where the rank of Σ is r.
It is easy to see that the y that will minimize v is yi = zi/σi for i = 1, ..., r, hence y = Σ†z.
To summarize, the least squares solution is reduced to a diagonal least squares problem (Σy = z), which requires only simple matrix-vector multiplications. The calculations in reverse order include
We have rediscovered the pseudoinverse, and we have come back to (16.12), while verifying that this is indeed the least squares solution.
Let’s revisit the least squares problem that we solved using the normal equations in Example 12.13 and the Householder method in Example 13.3. The 7 × 2 overdetermined linear system, Ax = b, is
The best fit line coefficients are found in four steps.
Compute the SVD, A = UΣVT. (The matrix dimensions are as follows: U is 7 × 7, Σ is 7 × 2, V is 2 × 2.)
Compute
Compute
Compute
resulting in the same best fit line, x2 = −0.23xi + 34.8, as we found via the normal equations and the Householder method.
The normal equations (16.10) give a best approximation
to the original problem Ax = b. This approximation was developed by considering a new right-hand side vector b′ in the subspace of A, called V. If we substitute the expression for x in (16.13) into Ax = b′, we have an expression for b′ in relation to b,
Geometrically, we can see that AA† is a projection because the goal is to project b into the subspace V. A projection must be idempotent as well, and a property of the pseudoinverse in (16.8) ensures this.
This application of the SVD gave us the opportunity to bring together several important linear algebra topics!
Suppose the m × n matrix A has k = min(m, n) singular values σi and as before, σ1 > σ2 > ... > σk. Using the SVD, it is possible to write A as a sum of k rank one matrices:
This is analogous to what we did in (7.13) for the eigendecomposition.
We can use (16.14) for image compression. An image is comprised of a grid of colored pixels.1 Figure 16.7 (left) is a very simple example; it has only 4 × 4 pixels. Each grayscale is associated with a number, thus this grid can be thought of as a matrix. The singular values for this matrix are σi = 7.1, 3.8, 1.3, 0.3. Let’s refer to the images from left to right as I0, I1, I2, I3. The original image is I0. The matrix
results in image I2 and the matrix
results in image I2. Notice that the original image is nearly replicated incorporating only half the singular values. This is due to the fact that σ1 and σ2 are large in comparison to σ3 and σ4. Image I3 is created from . Image I4, which is not displayed, is identical to I0.
The change in an image by adding only those Ii corresponding to small σ1 can be hardly noticeable. Thus omitting images Ik corresponding to small σk amounts to compressing the original image; there is no severe quality loss for small σk. Furthermore, if some σi are zero, compression can clearly be achieved. This is the case in Figure 16.1, an 8 × 8 matrix with σi = 6.2, 1.7, 1.49, 0, ..., 0. The figure illustrates images corresponding to each nonzero σi. The last image is identical to the input, making it clear that the five remaining σi = 0 are unimportant to image quality.
Figure 16.8 illustrates a 2D scatter plot: each circle represents a coordinate pair (point) in the [e1, e2]-system. For example, we might want to determine if gross domestic product (GDP) and the poverty rate (PR) for countries in the World Trade Organization are related. We would then record [GDP PR]T as coordinate pairs. How might we reveal trends in this data set?
Let our data set be given as x1, ..., xn, a set of 2D points such that x1 + ... + xn = 0. This condition simply means that the origin is the centroid of the points. The data set in Figure 16.8 has been translated to produce Figure 16.9 (left). Let d be a unit vector. Then the projection of xi onto a line through the origin containing d is a vector with (signed) length xi · d. Let l(d) be the sum of the squares of all these lengths:
Imagine rotating d around the origin. For each position of d, we compute the value l(d). For the longest line in the left part of Figure 16.9, the value of l(d) is large in comparison to d orthogonal to it, demonstrating that the value of l(d) indicates the variation in the point set from the line generated by d. Higher variation results in larger l(d).
Let’s arrange the data xi in a matrix
Then we rewrite l(d) as
We further abbreviate C = XTX and note that C is a symmetric, positive definite 2 × 2 matrix. Hence (16.15) describes a quadratic form just as we discussed in Section 7.6.
For which d is l(d) maximal? The answer is simple: for d being the eigenvector corresponding to C’s largest eigenvalue. Similarly, l(d) is minimal for d being the eigenvector corresponding to C’s smallest eigenvalue. Recall that these eigenvectors form the major and minor axis of the action ellipse of C, and the thick lines in Figure 16.9 (left) are precisely these axes. In the right part of Figure 16.9, the quadratic form for the data set is shown along with the data and action ellipse axes. We see that the dominant eigenvector corresponds to highest variance in the data and this is reflected in the quadratic form as well. If λ1 = λ2, then there is no preferred direction in the point set and the quadratic form is spherical. We are guaranteed that the eigenvectors will be orthogonal because C is a symmetric matrix.
Looking more closely at C, we see its very simple form,
If each element of C is divided by n, then this is called the covariance matrix. This matrix is a summary of the variation in each coordinate and between coordinates. Dividing by n will result in scaled eigenvalues; the eigenvectors will not change.
The eigenvectors provide a convenient local coordinate frame for the data set. This isn’t a new idea: it is exactly the principle of the eigendecomposition. This frame is commonly called the principal axes. Now we can construct an orthogonal transformation of the data, expressing them in terms of a new, more meaningful coordinate system. Let the matrix V = [v1 v2] hold the normalized eigenvectors as column vectors, where v1 is the dominant eigenvector. The orthogonal transformation of the data X that aligns v1 with e1 and v2 with e2 is simply
Figure 16.10 (top) illustrates the result of this transformation. Revisit Example 7.6: this is precisely the transformation we used to align the contour ellipse to the coordinate axes. (The transformation is written a bit differently here to accommodate the point set organized as transposed vectors.)
In summary: the data coordinates are now in terms of the trend lines, defined by the eigenvectors of the covariance matrix, and the coordinates directly measure the distance from each trend line. The greatest variance corresponds to the first coordinate in this principal components coordinate system. This leads us to the name of this method: principal components analysis (PCA).
So far, PCA has worked with all components of the given data. However, it can also be used for data compression by reducing dimensionality. Instead of constructing V to hold all eigenvectors, we may use only the most significant, so suppose V = v1. This transformation produces the middle image shown in Figure 16.10. If instead, V = v2, then the result is the bottom image, and for clarity, we chose to display these points on the e2-axis, but this is arbitrary. Comparing these results: there is greater spread of the data in the middle image, which corresponds to a trend line with higher variance.
Here we focused on 2D data, but the real power of PCA comes with higher dimensional data for which it is very difficult to visualize and understand relationships between dimensions. PCA makes it possible to identify insignificant dimensions and eliminate them.
Find the SVD for
What is the eigendecomposition of matrix A in Exercise 1.
For what type of matrix are the eigenvalues the same as the singular values?
The action of a 2 × 2 linear map can be described by the mapping of the unit circle to an ellipse. Figure 4.3 illustrates such an ellipse. What are the lengths of the semi-axes? What are the singular values of the corresponding matrix,
Find the SVD for the matrix
Let
and C = ATA. Is one of the eigenvalues of C negative?
For the matrix
show that both (16.5) and (16.6) yield the same result for the absolute value of the determinant of A.
For the matrix
show that both (16.5) and (16.6) yield the same result for the determinant of A.
For the matrix
show that both (16.5) and (16.6) yield the same result for the absolute value of the determinant of A.
What is the pseudoinverse of the matrix from Exercise 5?
What is the pseudoinverse of the matrix
What is the pseudoinverse of the matrix
Note that this matrix is actually a vector.
What is the pseudoinverse of
What is the least squares solution to the linear system Av = b given by:
Use the pseudoinverse and the enumerated steps in Section 16.6. The SVD of A may be found in Example 16.3.
What is the least squares solution to the linear system Av = b given by:
Use the pseudoinverse and the enumerated steps in Section 16.6.
What is the least squares solution to the linear system Av = b given by:
Use the pseudoinverse and the enumerated steps in Section 16.6.
For the following data set X, apply PCA using all eigenvectors. Give the covariance matrix and the final components in the principal components coordinate system.
Make a sketch of the data; this will help with finding the solution.
For the data set in Exercise 17, apply PCA using the dominant eigenvector only.
1We use grayscales here.
18.224.62.105