Using SVD for Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a tool that's used primarily for dimensionality reduction. We can use this to look at a dataset and find which dimensions and linear subspaces are the most salient. While there are several ways to implement this, we will show you how to perform PCA using SVD.

We'll do this as follows—we will work with a dataset that exists in 10 dimensions. We will start by creating two vectors that are heavily weighted in the front, and 0 otherwise:

vals = [ np.float32([10,0,0,0,0,0,0,0,0,0]) , np.float32([0,10,0,0,0,0,0,0,0,0]) ]

We will then add 9,000 additional vectors: 6,000 of these will be the same as the first two vectors, only with a little added random white noise, and the remaining 3,000 will just be random white noise:

for i in range(3000):
vals.append(vals[0] + 0.001*np.random.randn(10))
vals.append(vals[1] + 0.001*np.random.randn(10))
vals.append(0.001*np.random.randn(10))

We will now typecast the vals list to a float32 NumPy array. We take the mean over the rows and subtract this value from each row. (This is a necessary step for PCA.) We then transpose this matrix, since cuSolver requires that input matrices have fewer or equal rows compared to the columns:

vals = np.float32(vals)
vals = vals - np.mean(vals, axis=0)
v_gpu = gpuarray.to_gpu(vals.T.copy())

We will now run cuSolver, just like we did previously, and copy the output values off of the GPU:

U_d, s_d, V_d = linalg.svd(v_gpu, lib='cusolver')

u = U_d.get()
s = s_d.get()
v = V_d.get()

Now we are ready to begin our investigative work. Let's open up IPython and take a closer look at u and s. First, let's look at s; its values are actually the square roots of the principal values, so we will square them and then take a look:

You will notice that the first two principal values are of the order 105, while the remaining components are of the order 10-3. This tells us there is only really a two-dimensional subspace that is even relevant to this data at all, which shouldn't be surprising. These are the first and second values, which will correspond to the first and second principal components that is, the corresponding vectors. Let's take a look at these vectors, which will be stored in U:

You will notice that these two vectors are very heavily weighted in the first two entries, which are of the order 10-1; the remaining entries are all of the order 10-6 or lower, and are comparably irrelevant. This is what we should have expected, considering how biased we made our data in the first two entries. That, in a nutshell, is the idea behind PCA.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.191.86