How to do it...

The Cleveland Heart Disease database is a published dataset used by ML researchers. The dataset contains more than a dozen fields, and experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (value 1,2,3) and absence (value 0) of the disease (in the goal column, 14th column).
The Cleveland Heart Disease dataset is available at http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data.
The dataset contains the following attributes (age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, num) that are depicted as the header of the table below:

For a detailed explanation on the individual attributes, refer to: http://archive.ics.uci.edu/ml/datasets/Heart+Disease

The dataset will look like the following:

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	num
63	1	1	145	233	1	2	150	0	2.3	3	0	6	0
67	1	4	160	286	0	2	108	1	1.5	2	3	3	2
67	1	4	120	229	0	2	129	1	2.6	2	2	7	1
37	1	3	130	250	0	0	187	0	3.5	3	0	3	0
41	0	2	130	204	0	2	172	0	1.4	1	0	3	0
56	1	2	120	236	0	0	178	0	0.8	1	0	3	0
62	0	4	140	268	0	2	160	0	3.6	3	2	3	3
57	0	4	120	354	0	0	163	1	0.6	1	0	3	0
63	1	4	130	254	0	2	147	0	1.4	2	1	7	2
53	1	4	140	203	1	2	155	1	3.1	3	0	7	1
57	1	4	140	192	0	0	148	0	0.4	2	0	6	0
56	0	2	140	294	0	2	153	0	1.3	2	0	3	0
56	1	3	130	256	1	2	142	1	0.6	2	1	6	2
44	1	2	120	263	0	0	173	0	0	1	0	7	0
52	1	3	172	199	1	0	162	0	0.5	1	0	7	0
57	1	3	150	168	0	0	174	0	1.6	1	0	3	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...

Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.

Set up the package location where the program will reside:

package spark.ml.cookbook.chapter11.

Import the necessary packages for the Spark session:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.feature.PCA
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SparkSession

Create Spark's configuration and Spark session so we can have access to the cluster:

Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder
.master("local[*]")
.appName("MyPCA")
.config("spark.sql.warehouse.dir", ".")
.getOrCreate()

We read in the original raw data file and count the raw data:

val dataFile = "../data/sparkml2/chapter11/processed.cleveland.data"
val rawdata = spark.sparkContext.textFile(dataFile).map(_.trim)
println(rawdata.count())

In the console, we get the following:

We pre-process the dataset (see the preceding code for details):

val data = rawdata.filter(text => !(text.isEmpty || text.indexOf("?") > -1))
 .map { line =>
 val values = line.split(',').map(_.toDouble)
 
 Vectors.dense(values)
 }
 
 println(data.count())

data.take(2).foreach(println)

In the preceding code, we filter the missing data record, and use Spark DenseVector to host the data. After filtering the missing data, we get the following count of data in the console:

The record print, 2, will look like the following:

[63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0.0]
[67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2.0]

We create a DataFrame from the data RDD, and create a PCA object for computing:

val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(4)
.fit(df)

The parameters for the PCA model are shown in the preceding code. We set the K value to 4. K represents the number of top K principal components that we are interested in after completing the dimensionality reduction algorithm.

An alternative is also available via the Matrix API: mat.computePrincipalComponents(4). In this case, the 4 represents the top K principal components after the dimensionality reduction is completed.

We use the transform function to do computing and show the result in the console:

val pcaDF = pca.transform(df)
val result = pcaDF.select("pcaFeatures")
result.show(false)

The following will be displayed on the console.

What you are seeing are the four new PCA components (PC1, PC2, PC3, and PC4), which can be substituted for the original 14 features. We have successfully mapped the high-dimensional space (14 dimensions) to a lower-dimensional space (four dimensions):

From the Spark Master (http://localhost:4040/jobs), you can also track the job, as shown in the following figure:

We then close the program by stopping the Spark session:

spark.stop()

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...