The iris dataset consists of 150 rows and four columns. Each row/observation represents a single flower while the columns/features represent four different quantitative characteristics about the flower. The goal of the dataset is to fit a classifier that attempts to predict one of three types of iris given the four features. The flower may be considered either a setosa, a virginica, or a versicolor.
This dataset is so common in the field of machine learning instruction, scikit-learn has a built-in module for downloading the dataset:
- Let's first import the module and then extract the dataset into a variable called iris:
# import the Iris dataset from scikit-learn
from sklearn.datasets import load_iris
# import our plotting module
import matplotlib.pyplot as plt
%matplotlib inline
# load the Iris dataset
iris = load_iris()
- Now, let's store the extracted data matrix and response variables into two new variables, iris_X and iris_y, respectively:
# create X and y variables to hold features and response column
iris_X, iris_y = iris.data, iris.target
- Let's take a look at the names of the flowers that we are trying to predict:
# the names of the flower we are trying to predict.
iris.target_names
array(['setosa', 'versicolor', 'virginica'], dtype='|S10')
- Along with the names of the flowers, we can also look at the names of the features that we are utilizing to make these predictions:
# Names of the features
iris.feature_names
['sepal length (cm)',
'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
- To get a sense of what our data looks like, let's write some code that will display the data-points of two of the four features:
# for labelling
label_dict = {i: k for i, k in enumerate(iris.target_names)}
# {0: 'setosa', 1: 'versicolor', 2: 'virginica'}
def plot(X, y, title, x_label, y_label):
ax = plt.subplot(111)
for label,marker,color in zip(
range(3),('^', 's', 'o'),('blue', 'red', 'green')):
plt.scatter(x=X[:,0].real[y == label],
y=X[:,1].real[y == label],
color=color,
alpha=0.5,
label=label_dict[label]
)
plt.xlabel(x_label)
plt.ylabel(y_label)
leg = plt.legend(loc='upper right', fancybox=True)
leg.get_frame().set_alpha(0.5)
plt.title(title)
plot(iris_X, iris_y, "Original Iris Data", "sepal length (cm)", "sepal width (cm)")
The following is the output of the preceding code:
Let us now perform a PCA of the iris dataset in order to obtain our principal components. Recall that this happens in four steps.