PCA with the Iris dataset – manual example

The iris dataset consists of 150 rows and four columns. Each row/observation represents a single flower while the columns/features represent four different quantitative characteristics about the flower. The goal of the dataset is to fit a classifier that attempts to predict one of three types of iris given the four features. The flower may be considered either a setosa, a virginica, or a versicolor.

This dataset is so common in the field of machine learning instruction, scikit-learn has a built-in module for downloading the dataset:

  1. Let's first import the module and then extract the dataset into a variable called iris:
# import the Iris dataset from scikit-learn
from sklearn.datasets import load_iris
# import our plotting module
import matplotlib.pyplot as plt
%matplotlib inline

# load the Iris dataset
iris = load_iris()
  1. Now, let's store the extracted data matrix and response variables into two new variables, iris_X and iris_y, respectively:
# create X and y variables to hold features and response column
iris_X, iris_y = iris.data, iris.target
  1. Let's take a look at the names of the flowers that we are trying to predict:
# the names of the flower we are trying to predict.
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='|S10')
  1. Along with the names of the flowers, we can also look at the names of the features that we are utilizing to make these predictions:
# Names of the features
iris.feature_names

['sepal length (cm)',
'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
  1. To get a sense of what our data looks like, let's write some code that will display the data-points of two of the four features:
# for labelling
label_dict = {i: k for i, k in enumerate(iris.target_names)}
# {0: 'setosa', 1: 'versicolor', 2: 'virginica'}

def plot(X, y, title, x_label, y_label):
ax = plt.subplot(111)
for label,marker,color in zip(
range(3),('^', 's', 'o'),('blue', 'red', 'green')):

plt.scatter(x=X[:,0].real[y == label],
y=X[:,1].real[y == label],
color=color,
alpha=0.5,
label=label_dict[label]
)

plt.xlabel(x_label)
plt.ylabel(y_label)

leg = plt.legend(loc='upper right', fancybox=True)
leg.get_frame().set_alpha(0.5)
plt.title(title)

plot(iris_X, iris_y, "Original Iris Data", "sepal length (cm)", "sepal width (cm)")

The following is the output of the preceding code:

Let us now perform a PCA of the iris dataset in order to obtain our principal components. Recall that this happens in four steps.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.253.62