How to do it...

  1. The Cleveland Heart Disease database is a published dataset used by ML researchers. The dataset contains more than a dozen fields, and experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (value 1,2,3) and absence (value 0) of the disease (in the goal column, 14th column).

  2. The Cleveland Heart Disease dataset is available at http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data.

  3. The dataset contains the following attributes (age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, num) that are depicted as the header  of the table below:

For a detailed explanation on the individual attributes, refer to: http://archive.ics.uci.edu/ml/datasets/Heart+Disease
  1. The dataset will look like the following:

age

sex

cp

trestbps

chol

fbs

restecg

thalach

exang

oldpeak

slope

ca

thal

num

63

1

1

145

233

1

2

150

0

2.3

3

0

6

0

67

1

4

160

286

0

2

108

1

1.5

2

3

3

2

67

1

4

120

229

0

2

129

1

2.6

2

2

7

1

37

1

3

130

250

0

0

187

0

3.5

3

0

3

0

41

0

2

130

204

0

2

172

0

1.4

1

0

3

0

56

1

2

120

236

0

0

178

0

0.8

1

0

3

0

62

0

4

140

268

0

2

160

0

3.6

3

2

3

3

57

0

4

120

354

0

0

163

1

0.6

1

0

3

0

63

1

4

130

254

0

2

147

0

1.4

2

1

7

2

53

1

4

140

203

1

2

155

1

3.1

3

0

7

1

57

1

4

140

192

0

0

148

0

0.4

2

0

6

0

56

0

2

140

294

0

2

153

0

1.3

2

0

3

0

56

1

3

130

256

1

2

142

1

0.6

2

1

6

2

44

1

2

120

263

0

0

173

0

0

1

0

7

0

52

1

3

172

199

1

0

162

0

0.5

1

0

7

0

57

1

3

150

168

0

0

174

0

1.6

1

0

3

0

...

...

...

...

...

...

...

...

...

...

...

...

...

...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
  1. Set up the package location where the program will reside:

package spark.ml.cookbook.chapter11.

  1. Import the necessary packages for the Spark session:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.feature.PCA
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SparkSession
  1. Create Spark's configuration and Spark session so we can have access to the cluster:
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder
.master("local[*]")
.appName("MyPCA")
.config("spark.sql.warehouse.dir", ".")
.getOrCreate()
  1. We read in the original raw data file and count the raw data:
val dataFile = "../data/sparkml2/chapter11/processed.cleveland.data"
val rawdata = spark.sparkContext.textFile(dataFile).map(_.trim)
println(rawdata.count())

In the console, we get the following:

303
  1. We pre-process the dataset (see the preceding code for details):
val data = rawdata.filter(text => !(text.isEmpty || text.indexOf("?") > -1))
.map { line =>
val values = line.split(',').map(_.toDouble)

Vectors.dense(values)
}

println(data.count())

data.take(2).foreach(println)

In the preceding code, we filter the missing data record, and use Spark DenseVector to host the data. After filtering the missing data, we get the following count of data in the console:

297

The record print, 2, will look like the following:

[63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0.0]
[67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2.0]
  1. We create a DataFrame from the data RDD, and create a PCA object for computing:
val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(4)
.fit(df)
  1. The parameters for the PCA model are shown in the preceding code. We set the K value to 4. K represents the number of top K principal components that we are interested in after completing the dimensionality reduction algorithm.
  1. An alternative is also available via the Matrix API: mat.computePrincipalComponents(4). In this case, the 4 represents the top K principal components after the dimensionality reduction is completed.
  1. We use the transform function to do computing and show the result in the console:
val pcaDF = pca.transform(df)
val result = pcaDF.select("pcaFeatures")
result.show(false)

The following will be displayed on the console.

What you are seeing are the four new PCA components (PC1, PC2, PC3, and PC4), which can be substituted for the original 14 features. We have successfully mapped the high-dimensional space (14 dimensions) to a lower-dimensional space (four dimensions):

  1. From the Spark Master (http://localhost:4040/jobs), you can also track the job, as shown in the following figure:
  1. We then close the program by stopping the Spark session:
spark.stop()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.46.227