The Iris dataset

We are going to apply logistic regression to the iris dataset. This is a classic dataset containing information about flowers from three closely related species: setosa, virginica, and versicolor. These are going to be our dependent variables, that is, the classes we want to predict. We have 50 individual cases of each species and for each individual case, the dataset contains four variables that we are going to use as the independent variables (or features): petal length, petal width, sepal length, and sepal width. In case you are wondering, sepals are modified leaves whose function is generally related to protecting the flowers in a bud. We can load a data frame with the iris dataset by doing the following:

iris = pd.read_csv('../data/iris.csv')
iris.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Now, we will plot the three species versus sepal_length using the stripplot function from seaborn:

sns.stripplot(x="species", y="sepal_length", data=iris, jitter=True)

Figure 4.2

In Figure 4.2, the y axis is continuous while the x axis is categorical; the dispersion (or jitter) of the points along the x axis has no meaning at all, it is just a trick we add with the jitter argument to avoid having all the points collapsed onto a single line. Try setting the jitter argument to False to see what I mean. The only thing that matters when reading the x axis is the membership of the points to the setosa, versicolor, or virginica classes. You may also try other plots for this data, such as violin plots, which are also available as one-liners with seaborn.

Another way to inspect the data is by doing a scatter matrix with the pairplot function. We have scatter plots arranged in a grid, since we have four features in the iris dataset. The grid is symmetrical, with the upper and lower triangles showing the same information. The scatter plot on the main diagonal should correspond to a variable against itself; given that such a plot is not informative at all, we have replaced those scatters plots with a kde for each feature. Inside each subplot, we have the three species (or classes) represented with a different color, which is the same as used in Figure 4.2:

sns.pairplot(iris, hue='species', diag_kind='kde')

Figure 4.3

Before continuing, take some time to study Figure 4.3 and try to get familiar with the iris dataset and how the features and classes are related.

Table of Contents for The Iris dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
The Iris dataset