PCA

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA algorithm can be used to project vectors to a low-dimensional space using PCA. Then, based on the reduced feature vectors, an ML model can be trained. The following example shows how to project 6D feature vectors into four-dimensional principal components. Suppose, you have a feature vector as follows:

val data = Array(
Vectors.dense(3.5, 2.0, 5.0, 6.3, 5.60, 2.4),
Vectors.dense(4.40, 0.10, 3.0, 9.0, 7.0, 8.75),
Vectors.dense(3.20, 2.40, 0.0, 6.0, 7.4, 3.34) )

Now let's create a DataFrame from it, as follows:

val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
df.show(false)

The preceding code produces a feature DataFrame having 6D feature vector for the PCA:

Figure 18: Creating a feature DataFrame (6-dimensional feature vectors) for PCA

Now let's instantiate the PCA model by setting necessary parameters as follows:

val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(4)
.fit(df)

Now, to make a difference, we set the output column as pcaFeatures using the setOutputCol() method. Then, we set the dimension of the PCA. Finally, we fit the DataFrame to make the transformation. Note that the PCA model includes an explainedVariance member. A model can be loaded from such older data but will have an empty vector for explainedVariance. Now let's show the resulting features:

val result = pca.transform(df).select("pcaFeatures") 
result.show(false)

The preceding code produces a feature DataFrame having 4D feature vectors as principal components using the PCA:


Figure 19: Four-dimensional principal components (PCA features)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.213.87