PCA with the Iris dataset – manual example

The iris dataset consists of 150 rows and four columns. Each row/observation represents a single flower while the columns/features represent four different quantitative characteristics about the flower. The goal of the dataset is to fit a classifier that attempts to predict one of three types of iris given the four features. The flower may be considered either a setosa, a virginica, or a versicolor.

This dataset is so common in the field of machine learning instruction, scikit-learn has a built-in module for downloading the dataset:

  1. Let's first import the module and then extract the dataset into a variable called iris:
# import the Iris dataset from scikit-learn
from sklearn.datasets import load_iris
# import our plotting module
import matplotlib.pyplot as plt
%matplotlib inline

# load the Iris dataset
iris = load_iris()
  1. Now, let's store the extracted data matrix and response variables into two new variables, iris_X and iris_y, respectively:
# create X and y variables to hold features and response column
iris_X, iris_y = iris.data, iris.target
  1. Let's take a look at the names of the flowers that we are trying to predict:
# the names of the flower we are trying to predict.

array(['setosa', 'versicolor', 'virginica'], dtype='|S10')
  1. Along with the names of the flowers, we can also look at the names of the features that we are utilizing to make these predictions:
# Names of the features

['sepal length (cm)',
'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
  1. To get a sense of what our data looks like, let's write some code that will display the data-points of two of the four features:
# for labelling
label_dict = {i: k for i, k in enumerate(iris.target_names)}
# {0: 'setosa', 1: 'versicolor', 2: 'virginica'}

def plot(X, y, title, x_label, y_label):
ax = plt.subplot(111)
for label,marker,color in zip(
range(3),('^', 's', 'o'),('blue', 'red', 'green')):

plt.scatter(x=X[:,0].real[y == label],
y=X[:,1].real[y == label],


leg = plt.legend(loc='upper right', fancybox=True)

plot(iris_X, iris_y, "Original Iris Data", "sepal length (cm)", "sepal width (cm)")

The following is the output of the preceding code:

Let us now perform a PCA of the iris dataset in order to obtain our principal components. Recall that this happens in four steps.

