PCA with the Iris dataset – manual example

The iris dataset consists of 150 rows and four columns. Each row/observation represents a single flower while the columns/features represent four different quantitative characteristics about the flower. The goal of the dataset is to fit a classifier that attempts to predict one of three types of iris given the four features. The flower may be considered either a setosa, a virginica, or a versicolor.

This dataset is so common in the field of machine learning instruction, scikit-learn has a built-in module for downloading the dataset:

Let's first import the module and then extract the dataset into a variable called iris:

# import the Iris dataset from scikit-learn
from sklearn.datasets import load_iris
# import our plotting module
import matplotlib.pyplot as plt
%matplotlib inline

# load the Iris dataset
iris = load_iris()

Now, let's store the extracted data matrix and response variables into two new variables, iris_X and iris_y, respectively:

# create X and y variables to hold features and response column
iris_X, iris_y = iris.data, iris.target

Let's take a look at the names of the flowers that we are trying to predict:

# the names of the flower we are trying to predict.
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='|S10')

Along with the names of the flowers, we can also look at the names of the features that we are utilizing to make these predictions:

# Names of the features
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

To get a sense of what our data looks like, let's write some code that will display the data-points of two of the four features:

# for labelling
label_dict = {i: k for i, k in enumerate(iris.target_names)}
# {0: 'setosa', 1: 'versicolor', 2: 'virginica'}

def plot(X, y, title, x_label, y_label):
 ax = plt.subplot(111)
 for label,marker,color in zip(
 range(3),('^', 's', 'o'),('blue', 'red', 'green')):

 plt.scatter(x=X[:,0].real[y == label],
 y=X[:,1].real[y == label],
 color=color,
 alpha=0.5,
 label=label_dict[label]
 )

 plt.xlabel(x_label)
 plt.ylabel(y_label)

 leg = plt.legend(loc='upper right', fancybox=True)
 leg.get_frame().set_alpha(0.5)
 plt.title(title)

plot(iris_X, iris_y, "Original Iris Data", "sepal length (cm)", "sepal width (cm)")

The following is the output of the preceding code:

Let us now perform a PCA of the iris dataset in order to obtain our principal components. Recall that this happens in four steps.

Table of Contents for PCA with the Iris dataset – manual example

Create new playlist

Sign In

Sign Up

Table of Contents for
PCA with the Iris dataset – manual example