In the previous chapter, you learned about techniques to explore data and perform feature engineering with NumPy, Pandas, and Scikit-learn. In this chapter you will learn to use Matplotlib to visualize data. Data visualization helps to understand the characteristics and relationships between the features during the data exploration phase but becomes particularly important when you are dealing with very large datasets that have several hundreds of features.
You can download the code files for this chapter from www.wiley.com/go/machinelearningawscloud
or from GitHub using the following URL:
Matplotlib is a plotting library for Python that offers functionality to generate numerous types of plots and the ability to customize these plots. It was created by John Hunter to provide Python users with a plotting library with capabilities similar to MATLAB. MATLAB is the standard for data visualization in the scientific community. The Matplotlib package has an extensive codebase and was designed to address the needs of a variety of users and provide capabilities at different levels of abstraction. Some users may want to simply present Matplotlib with some data in an array and ask for a specific type of plot (such as a scatter plot) to be created using as few commands as possible; these users may not want to control detailed attributes of the plot (such as positioning, scaling, line style, color). On the other hand, some users may want the ability to control every single attribute of a plot, down to the level of individual pixels.
For most common plotting tasks, you will use the pyplot
module within Matplotlib, which provides the highest level of abstraction. The pyplot
module implements a functional interface, powered by a state-machine design—you use functions to set up attributes of the plotting engine such as colors and fonts, and these apply to all subsequent plots until you issue commands to change them. Beneath the pyplot
level of abstraction is the object-oriented interface, which offers more flexibility.
The conventional alias for the pyplot
module is plt
, the alias for the Matplotlib package is mpl
, and the alias for the Seaborn package is sns
. The following statements demonstrate how to import Matplotlib and Seaborn in a Python project:
The reason to import both the pyplot
submodule and the matplotlib
package is to allow you to use functions from both the higher level of abstraction provided by pyplot
and the lower-level object-oriented API exposed by Matplotlib. If you are using Matplotlib in a Jupyter Notebook, you must also add the %matplotlib inline
statement before drawing any figures to ensure that plots render within the cells of the notebook.
Before looking at the components of a Matplotlib figure, let's examine the code to plot a simple curve using the pyplot
module. The following snippet plots the function y1=4x2 + 2x + 2 and the function y2 = 3x + 4 for values of x between 1 and 7. Figure 3.1 depicts the figure generated by these statements.
The code snippet uses NumPy functions to generate ndarrays x
, y1
, and y2
. A Matplotlib figure object with dimensions 7 × 7 inches is then created using the statement plt.figure(figsize=(7,7))
. The first curve y1=4x2 + 2x + 2 is plotted using the plt.plot(x, y1)
statement; the inputs to the plot function are 2D coordinates of the points that need to be plotted. The plt.xlabel('X values')
and plt.ylabel('Y values')
statements are for aesthetic purposes and add labels along the x- and y-axes of the plot, respectively. Finally, the plt.show()
statement is used to render the figure.
Regardless of the type of plot you make with Matplotlib or whether you use the high-level pyplot
interface or the lower-level object-oriented interface, there are common components and terminology associated with plots. This section presents an overview of these concepts. Figure 3.2 depicts the parts of a plot.
A figure object can be thought of as the entire diagram, with all the lines and text. You can think of it as the container; everything you plot using Matplotlib must belong to a figure. Figures are generally created using a pyplot
function, even if you then want to manipulate the figure with the lower-level object-oriented API. Use the following pyplot
command to create an empty figure:
plt.figure()
If you want the figure to have specific dimensions, you can provide a tuple with the x-axis and y-axis dimensions in inches:
plt.figure(figsize=(10,5))
Besides the figsize
attribute, the pyplot
figure function can accept various other attributes that you can use to customize aspects of the figure at the point of creation. You can find out more about these attributes at https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html
.
The pyplot
figure function creates a figure and makes it the active figure on which subsequent drawing operations will have effect. The figure is not displayed until there is something drawn to it. If you want to use the object-oriented API to control aspects of the figure after it has been created, you need to store a reference to the figure object in your code and then use the object-oriented API with this reference. The following code snippet creates a figure using the pyplot figure()
function and then uses the object-oriented API to change the background color of the figure:
An axes object is what you would normally consider as a plot. It is the actual graph with various characteristics commonly associated with plots, such as data points, markers, colors, scales, etc. A figure object can contain multiple axes, which in effect are multiple subplots within a larger diagram. The following snippet uses pyplot
to create a figure with a 2 × 2 grid of axes (subplots), and stores references to both the figure and the axes objects. Each axes object has its own title, which is different from the figure title. The object-oriented API is used to set the title for the figure and the four subplots within the figure. Figure 3.3 depicts the figure generated by executing the code snippet.
# create a figure with a 2 x 2 grid of axes objects
figure, axes_list = plt.subplots(2,2, figsize=(9,9))
# title for the figure object
figure.suptitle('This is the title of the figure')
# title for each axis
axes_list[0,0].set_title('Subplot 0')
axes_list[0,1].set_title('Subplot 1')
axes_list[1,0].set_title('Subplot 2')
axes_list[1,1].set_title('Subplot 3')
The Matplotlib figure
and axes
classes are the primary entry points for working with the lower-level object-oriented API.
The axis object represents a dimension within a subplot. Two-dimensional plots have two axis objects: one for the horizontal direction and the other for the vertical direction. The axes
class, which is part of the object-oriented interface, provides several methods to modify attributes of the underlying x- and y-axis.
The axis label is displayed beneath (or beside) each axis of the plot. Axis labels can be configured by calling the set_xlabel()
and set_ylabel()
methods on an axes object. The following snippet demonstrates the use of these methods:
figure, axes = plt.subplots(figsize=(10,10))
axes.set_xlabel('Variable 1')
axes.set_ylabel('Variable 2')
If you do not want to use the object-oriented API exposed by the axes object, you can use the xlabel()
and ylabel()
functions from the pyplot
high-level interface that set the axis labels for the active plot. The use of these functions is demonstrated in the following snippet:
plt.figure(figsize=(7,7))
plt.xlabel('X values')
plt.ylabel('Y values')
A grid is a set of horizontal and vertical lines inside the plot area that helps in reading values. The axes object provides a method called grid()
that can be used to customize the appearance of the grid. You can find out more about the grid()
method of the axes
class at https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.grid.html
. The pyplot
module also provides a grid()
function that is identical to the similarly named method of the axes
class.
The following snippet demonstrates the use of the grid()
method of the axes
class:
figure, axes = plt.subplots(figsize=(10,10))
axes.grid(b=True, which='both', linestyle='--', linewidth=1)
Figure 3.4 depicts two plots side by side, one with a grid and one without.
The title is a string that is displayed on top of a plot. Both the figure object and the axes objects within the figure can have titles. Titles are displayed, by default, at the top of the figure or axes object, center aligned. The figure title can be changed use the suptitle()
method of the figure
class, or the suptitle()
function of the pyplot
module. These functions are identical to each other in syntax and function. You can find more information on the pyplot suptitle()
function at https://matplotlib.org/api/_as_gen/matplotlib.pyplot.suptitle.html#matplotlib.pyplot.suptitle
.
The following snippet demonstrates the use of the suptitle()
method of the figure
class, and the suptitle()
function of the pyplot
module:
# create a figure with one axes object.
# Call the suptitle() method on the figure instance.
figure, axes = plt.subplots(figsize=(10,10))
figure.suptitle('Figure Title')
# create a new figure with one axes object
# use the pyplot suptitle() function
plt.figure(figsize=(7,7))
plt.suptitle('Figure Title')
The title of the axes object can be changed by calling the set_title()
method on the axes object. You can find more information on using the set_title()
method at https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_title.html#matplotlib.axes.Axes.set_title
.
The pyplot
module provides a convenience function called title()
, which operates on the active axes object and has the same signature as the set_title()
method of the axes
class. The following snippet demonstrates the use of the set_title()
method of the axes
class and the title()
function of the pyplot
module:
# create a figure with one axes object.
# Call the set_title() method on the axes instance.
figure, axes = plt.subplots(figsize=(10,10))
axes.suptitle(Axes Title')
# create a new figure with one axes object
# use the pyplot suptitle() function
plt.figure(figsize=(7,7))
plt.title(Axes Title')
In the previous section you learned the aspects of a typical Matplotlib plot, and that there are often different ways to configure a plot. The high-level pyplot
API provides a functional interface and the lower-level Matplotlib API operates using an object-oriented interface. In this section, you will learn to create common types of plots using Matplotlib.
A histogram is commonly used to visualize the distribution of a numeric variable. Histograms are not applicable when dealing with a categorical variable. The following snippet uses functions from the pyplot
module to generate a histogram of the Age
attribute of the popular Titanic dataset. The resulting figure is depicted in Figure 3.5.
import numpy as np
import pandas as pd
# load the contents of a file into a pandas Dataframe
input_file = './datasets/titanic_dataset/original/train.csv'
df_titanic = pd.read_csv(input_file)
# set the index
df_titanic.set_index("PassengerId", inplace=True)
# use pyplot functions to plot a histogram of the 'Age' attribute
fig = plt.figure(figsize=(7,7))
plt.xlabel('Passenger Age')
plt.ylabel('Count')
plt.grid()
n, bins, patches = plt.hist(df_titanic['Age'], histtype='bar',
color='#0dc28d', align='mid',
rwidth=0.90, bins=7)
The pyplot hist()
function is used to create a histogram. The function takes several arguments. Some of the arguments that have been used in the preceding snippet are histtype='bar'
, which signifies that you want a standard histogram; rwidth=0.9
, which leaves space between the bars; and bins=7
, which is the number of bars. You can find a full list of parameters supported by the hist()
function at https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html#matplotlib.pyplot.hist
.
The appearance of the histogram is influenced by the binning strategy, which in turn is controlled by the value you pass into the bins
parameter. If you pass an integer (as in the preceding snippet), Matplotlib will create the specified number of equal-width bins and use the boundaries (edges) of the bins to determine the number of values in each bin. You can optionally specify the bin edges instead of the number of bins. In most cases you simply specify the number of bins.
The pyplot hist()
function returns a tuple, (n, bins, patches)
. The first element of the tuple, n
, is an array with the counts for each bin. The second element, bins
, is an array of floating-point values that contains the bin edges, and the third element, patches
, is an array of rectangle objects that represent the low-level drawing primitives used by Matplotlib to make the bars. You can use the print()
statement to inspect the contents of the tuple:
The following snippet uses pyplot
functions to create a figure with four subplots (axes objects) and uses the object-oriented API to create histograms of the same data in each of the subplots, but with different numbers of bins. The result of this snippet is depicted in Figure 3.6.
# use pyplot functions and matplotlib object-oriented API
# to plot multiple histograms of the same data, with different
# binning strategies.
fig, axes_list = plt.subplots(2,2, figsize=(11,11))
# plot a histogram with 3 bins
axes_list[0,0].set_xlabel('Passenger Age')
axes_list[0,0].set_ylabel('Count')
axes_list[0,0].grid()
n1, bins1, patches1 = axes_list[0,0].hist(df_titanic['Age'], histtype='bar',
color='#0dc28d', align='mid',
rwidth=0.90, bins=3)
# plot a histogram with 10 bins
axes_list[0,1].set_xlabel('Passenger Age')
axes_list[0,1].set_ylabel('Count')
axes_list[0,1].grid()
n2, bins2, patches2 = axes_list[0,1].hist(df_titanic['Age'], histtype='bar',
color='#0dc28d', align='mid',
rwidth=0.90, bins=10)
# plot a histogram with 30 bins
axes_list[1,0].set_xlabel('Passenger Age')
axes_list[1,0].set_ylabel('Count')
axes_list[1,0].grid()
n3, bins3, patches3 = axes_list[1,0].hist(df_titanic['Age'], histtype='bar',
color='#0dc28d', align='mid',
rwidth=0.90, bins=30)
# plot a histogram with 100 bins
axes_list[1,1].set_xlabel('Passenger Age')
axes_list[1,1].set_ylabel('Count')
axes_list[1,1].grid()
n4, bins4, patches4 = axes_list[1,1].hist(df_titanic['Age'], histtype='bar',
color='#0dc28d', align='mid',
rwidth=0.90, bins=100)
As you can infer, the binning strategy significantly affects the appearance of the histogram and the inferences that you can make from the histogram. There is no set rule to the number of bins that must be used; often data scientists use a number of different binning strategies to reveal characteristics of the data that were not previously visible. A common rule of thumb is to set the number of bins to be the square root of the number of values as a starting point, and then update as necessary. A commonly used approach in statistics to select the bin width for histograms was proposed in 1981 by David Freedman and Persi Diaconis and is known as the Freedman-Diaconis rule. The general idea is to set the bin width to be 2 × IQR / (number of observations) 1/3. Using this equation to compute the bin width, you can divide the range of values by the bin width to work out the number of bins. You can get more information on this rule from the original paper published in 1981 titled “On the histogram as a density estimator.” You can access a copy of the paper at https://statistics.stanford.edu/sites/g/files/sbiybj6031/f/EFS%20NSF%20159.pdf
.
The Pandas dataframe object also provides limited plotting capabilities. These capabilities are built on top on Matplotlib, but in some situations, you may find the Pandas plotting functions simpler to use. The following snippet shows how you could create a simple histogram using the Pandas dataframe plot function:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
# load the contents of a file into a pandas Dataframe
input_file = './datasets/titanic_dataset/original/train.csv'
df_titanic = pd.read_csv(input_file)
# set the index
df_titanic.set_index("PassengerId", inplace=True)
fig = plt.figure(figsize=(7,7))
plt.xlabel('Passenger Age')
plt.ylabel('Count')
df_titanic['Age'].plot.hist(color='#0dc28d', align='mid',
rwidth=0.90, bins=7, grid=True)
Bar charts are commonly used when you are dealing with a categorical variable. Each bar in a bar chart represents some information about a categorical attribute, such as a count, mean, or other measure. Bar charts can be used with both nominal and ordinal categorical data. When plotting a bar chart for nominal categorical data it is common practice to order the bars so that the height (or length) of the bars increases in an orderly fashion.
This is possible because a category that contains nominal data does not have any inherent order between the values, and hence you can freely order the placement of the bars to create a visually pleasing figure. Continuing with the use of the Titanic dataset from the previous section, the following snippet creates a bar plot of the Embarked
attribute. The resulting bar chart is depicted in Figure 3.7.
# use pyplot functions to plot a bar chart of the 'Embarked' attribute
fig = plt.figure(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('Count')
plt.grid()
values = df_titanic['Embarked'].unique()
counts = df_titanic['Embarked'].value_counts(dropna=False)
x_positions = np.arange(len(values))
plt.bar(x_positions, counts, align='center')
plt.xticks(x_positions, values)
Plotting bar charts is significantly simpler with the Pandas dataframe plot function. The following snippet demonstrates how you can create the same bar chart using Pandas functions:
# use Pandas dataframe functions to plot a bar chart of the 'Embarked' attribute
fig = plt.figure(figsize=(7,7))
plt.xlabel('Embarkation Point')
plt.ylabel('Count')
plt.grid()
df_titanic['Embarked'].value_counts(dropna=False).plot.bar(grid=True)
If you want to show information about different subgroups within each category, a grouped bar chart can be used. A grouped bar chart, as its name suggests, is a chart with groups of bars clustered together. Each bar group provides information on one category, and the length of bars within the group provides information on the individual subgroups within the category.
For example, a grouped bar chart could be used to visualize the distribution of the number of individuals who survived and the number who did not, for each embarkation point. The following snippet creates a grouped bar chart for the Embarked
attribute with two bars in each group. The resulting bar chart is depicted in Figure 3.8.
# a grouped bar chart for the Embarked attribute with
# two bars per group.
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
values = df_titanic['Embarked'].dropna().unique()
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
x_positions = np.arange(len(values))
fig = plt.figure(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('Count')
plt.grid()
bar_width = 0.35
plt.bar(x_positions, embarked_counts_survived, bar_width, color='#009bdb', label='Survived')
plt.bar(x_positions + bar_width, embarked_counts_not_survived, bar_width, color='#00974d', label='Not Survived')
plt.xticks(x_positions + bar_width, values)
plt.legend()
plt.show()
The following snippet shows how to use the Pandas plotting functions to draw the same grouped bar chart:
# a grouped bar chart for the Embarked attribute with
# two bars per group.
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
values = df_titanic['Embarked'].dropna().unique()
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_survived.name
= 'Survived'
embarked_counts_not_survived.name
= 'Not Survived'df = pd.concat([embarked_counts_survived, embarked_counts_not_survived], axis=1)
fig, axes = plt.subplots(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('Count')
df.plot.bar(grid=True, ax=axes, color=['#009bdb', '#00974d'])
A stacked bar chart provides another way to visualize the same information. Instead of having multiple bars in clustered groups, a stacked bar chart uses one bar per categorical value and splits the bar to depict the distribution of subgroups within the category.
The following snippet creates a stacked bar chart for the Embarked
attribute showing the distribution of survivors from each embarkation point. The resulting bar chart is depicted in Figure 3.9.
# a stacked bar chart for the Embarked attribute
# showing the number of survivors in each category
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
values = df_titanic['Embarked'].dropna().unique()
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
x_positions = np.arange(len(values))
fig = plt.figure(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('Count')
plt.grid()
plt.bar(x_positions, embarked_counts_survived, color='#009bdb', label='Survived')
plt.bar(x_positions, embarked_counts_not_survived, color='#00974d', label='Not Survived', bottom=embarked_counts_survived)
plt.xticks(x_positions, values)
plt.legend()
plt.show()
Creating a stacked bar chart with Pandas plotting functions is simply a matter of adding the stacked=True
attribute to the argument list of the dataframe.plot.bar()
function. The following snippet using Pandas plotting functions to draw the same stacked bar chart:
# a stacked percentage bar chart for the Embarked attribute
# showing the number of survivors in each category
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
values = df_titanic['Embarked'].dropna().unique()
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_survived.name
= 'Survived'
embarked_counts_not_survived.name
= 'Not Survived'df = pd.concat([embarked_counts_survived, embarked_counts_not_survived], axis=1)
fig, axes = plt.subplots(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('Count')
df.plot.bar(stacked=True, grid=True, ax=axes, color=['#009bdb', '#00974d'])
The power of Pandas plotting functions over pyplot
and Matplotlib is evident with stacked bar charts, especially if you have more than two groups per bar. With Matplotlib and pyplot
functions, you will have to plot each group on top of the other using multiple plt.bar()
statements. With Pandas, all you need to do is get your data in a dataframe and make a single call to dataframe.plot.bar()
.
If you want to show the percentage contribution of each subgroup within the categories, you can use a stacked percentage bar chart. The bars in a stacked percentage bar chart are all the same height. The following snippet creates a stacked percentage bar chart for the Embarked
attribute showing the percentage of survivors from each embarkation point. The resulting bar chart is depicted in Figure 3.10.
# a stacked percentage bar chart for the Embarked attribute
# showing the number of survivors in each category
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
values = df_titanic['Embarked'].dropna().unique()
counts = df_titanic['Embarked'].value_counts(dropna=True)
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_survived_percent = embarked_counts_survived / counts * 100
embarked_counts_not_survived_percent = embarked_counts_not_survived / counts * 100
x_positions = np.arange(len(values))
fig = plt.figure(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('Percentage')
plt.grid()
plt.bar(x_positions, embarked_counts_survived_percent, color='#009bdb', label='Survived')
plt.bar(x_positions, embarked_counts_not_survived_percent, color='#00974d', label='Not Survived', bottom=embarked_counts_survived_percent)
plt.xticks(x_positions, values)
plt.legend()
plt.show()
The following snippet creates an equivalent chart using Pandas plotting functions:
# a stacked percentage bar chart for the Embarked attribute
# showing the number of survivors in each category
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
values = df_titanic['Embarked'].dropna().unique()
counts = df_titanic['Embarked'].value_counts(dropna=True)
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_survived_percent = embarked_counts_survived / counts * 100
embarked_counts_not_survived_percent = embarked_counts_not_survived / counts * 100
embarked_counts_survived_percent.name
= 'Survived'
embarked_counts_not_survived_percent.name
= 'Not Survived'df = pd.concat([embarked_counts_survived_percent, embarked_counts_not_survived_percent], axis=1)
fig, axes = plt.subplots(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('% Survived ')
df.plot.bar(stacked=True, grid=True, ax=axes, color=['#009bdb', '#00974d'])
Pie charts are an alternative to bar charts and can be used to plot the proportion of unique values in a categorical attribute. The pyplot
module provides the pie()
function that can be used to create a pie chart. The pie function of the pyplot
module uses the underlying pie()
method of the axes
class. You can access the documentation for the pyplot pie()
function at https://matplotlib.org/api/_as_gen/matplotlib.pyplot.pie.html
.
The following snippet uses the pyplot pie()
function on the contents of the Embarked
column of the Titanic dataset to create a pie chart that depicts the percentage of passengers boarding from each embarkation point. Figure 3.11 depicts the resulting pie chart.
# use pyplot functions to plot a pie chart of the 'Embarked' attribute
fig = plt.figure(figsize=(9,9))
embarkation_ports = df_titanic['Embarked'].dropna().unique()
counts = df_titanic['Embarked'].value_counts(dropna=True)
total_embarked = counts.values.sum()
counts_percentage = counts / total_embarked * 100
counts_percentage.values
plt.pie(counts_percentage.values,
labels=embarkation_ports,
autopct='%1.1f%%', shadow=True, startangle=90)
You can also use the Pandas dataframe plot.pie()
function to create pie charts. If your data is already in a Pandas dataframe, this approach will require significantly less code. The downside to using the Pandas plotting function is the reduced level of options to customize the pie chart. The following snippet uses Pandas plotting functions to make the same pie chart:
# use Pandas functions to plot a pie chart of the 'Embarked' attribute
fig = plt.figure(figsize=(7,7))
df_titanic['Embarked'].value_counts(dropna=True).plot.pie()
When the target attribute of a dataset is binary, a pie chart can help convey the distribution of values in each categorical attribute, grouped by the target binary attribute. The following snippet uses Matplotlib's Axes.pie()
method to create three pie charts showing the percentage of survivors from the three embarkation ports in the Titanic dataset. Figure 3.12 depicts the resulting pie charts.
# three pie charts, showing the proportion
# of survivors for each embarkation point
# S = Southampton
# C = Cherbourg
# Q = Queenstown
S_df = df_titanic[df_titanic['Embarked'] == 'S']
C_df = df_titanic[df_titanic['Embarked'] == 'C']
Q_df = df_titanic[df_titanic['Embarked'] == 'Q']
S_Total_Embarked = S_df['Embarked'].count()
S_Survived_Count = S_df[S_df['Survived']==1].Embarked.count()
S_Survived_Percentage = S_Survived_Count / S_Total_Embarked * 100
S_Not_Survived_Percentage = 100.0 - S_Survived_Percentage
C_Total_Embarked = C_df['Embarked'].count()
C_Survived_Count = C_df[C_df['Survived']==1].Embarked.count()
C_Survived_Percentage = C_Survived_Count / C_Total_Embarked * 100
C_Not_Survived_Percentage = 100.0 - C_Survived_Percentage
Q_Total_Embarked = Q_df['Embarked'].count()
Q_Survived_Count = Q_df[Q_df['Survived']==1].Embarked.count()
Q_Survived_Percentage = Q_Survived_Count / Q_Total_Embarked * 100
Q_Not_Survived_Percentage = 100.0 - Q_Survived_Percentage
fig, axes = plt.subplots(1, 3, figsize=(16,4))
Wedge_Labels = ['Survived', 'Not Survived']
S_Wedge_Sizes = [S_Survived_Percentage, S_Not_Survived_Percentage]
C_Wedge_Sizes = [C_Survived_Percentage, C_Not_Survived_Percentage]
Q_Wedge_Sizes = [Q_Survived_Percentage, Q_Not_Survived_Percentage]
axes[0].pie(S_Wedge_Sizes, labels=Wedge_Labels, autopct='%1.1f%%', shadow=True, startangle=90)
axes[0].set_title('Southampton')
axes[1].pie(C_Wedge_Sizes, labels=Wedge_Labels, autopct='%1.1f%%', shadow=True, startangle=90)
axes[1].set_title('Cherbourg')
axes[2].pie(Q_Wedge_Sizes, labels=Wedge_Labels, autopct='%1.1f%%', shadow=True, startangle=90)
axes[2].set_title('Queenstown')
The same pie charts can be created using far less code if you were to use the Pandas plotting functions, at the expense of loss in customizability. The following snippet will generate the same chart using Pandas plotting functions:
# three pie charts, showing the proportion
# of survivors for each embarkation point
# S = Southampton
# C = Cherbourg
# Q = Queenstown
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
values = df_titanic['Embarked'].dropna().unique()
counts = df_titanic['Embarked'].value_counts(dropna=True)
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_survived_percent = embarked_counts_survived / counts * 100
embarked_counts_not_survived_percent = embarked_counts_not_survived / counts * 100
embarked_counts_survived_percent.name
= 'Survived'
embarked_counts_not_survived_percent.name
= 'Not Survived'df = pd.concat([embarked_counts_survived_percent, embarked_counts_not_survived_percent], axis=1)
df.T.plot.pie(sharex=True, subplots=True, figsize=(16, 4))
A box plot provides a way to view the distribution of a numerical attribute. It was created in 1969 by John Tukey. A box plot is used to find out if the values of the attribute are symmetrically distributed, the overall spread of the values, and information on outliers. A box plot provides information on five statistical qualities of the attribute values:
The pyplot
module provides the boxplot()
function that can be used to create a box plot. The boxplot()
function of the pyplot
module uses the underlying boxplot()
method of the axes
class. You can access the documentation for the pyplot boxplot()
function at https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html
.
The following snippet uses the pyplot boxplot()
function on the contents of the Age
column of the Titanic dataset to create a box plot that depicts the spreads of the values of the attribute. Figure 3.13 depicts the resulting box plot.
# use pyplot functions to create a box plot of the 'Embarked' attribute
fig , axes = plt.subplots(figsize=(9,9))
box_plot = plt.boxplot(df_titanic['Age'].dropna())
axes.set_xticklabels(['Age'])
The following snippet will generate the same box plot using Pandas plotting functions:
df_titanic.boxplot(column = 'Age', figsize=(9,9), grid=False);
The compact nature of box plots makes them extremely useful to compare the distributions of different numerical attributes, or the distributions of subgroups within a numerical attribute. The following snippet uses pyplot
functions to create two box plots of the Age
attribute, one for those that survived the Titanic disaster, and the other for those that did not. Figure 3.14 depicts the resulting box plots.
# compare box plots of the Age attribute for those who survived
# against those that did not.
survived_df = df_titanic[df_titanic['Survived']==1].dropna()
not_survived_df = df_titanic[df_titanic['Survived']==0].dropna()
fig , axes = plt.subplots(figsize=(9,9))
box_plot = plt.boxplot([survived_df['Age'], not_survived_df['Age']],
labels=['Survived', 'Not Survived'])
The following snippet will generate the same box plot using Pandas plotting functions:
df_titanic.boxplot(column = 'Age', by = 'Survived', figsize=(9,9), grid=False);
A scatter plot is a 2D plot that plots two continuous numeric attributes against each other. One attribute is plotted along the x-axis and the other is plotted along the y-axis. Scatter plots are a collection of points and are typically used to plot the correlation between variables, with each point of the scatter plot representing the value of two variables. Scatter plots can also be used to visualize the grouping of data. The following snippet creates a scatter plot of the Age
and Fare
attributes from the Titanic dataset after normalizing the values and imputing missing values with the mean of the attribute. Figure 3.15 depicts the resulting scatter plot.
# impute missing values:
# Age with the median age
# Fare with the mean fare
median_age = df_titanic['Age'].median()
df_titanic["Age"].fillna(median_age, inplace=True)
mean_fare = df_titanic['Fare'].mean()
df_titanic["Fare"].fillna(mean_fare, inplace=True)
# use pyplot functions to create a scatter plot of the 'Age' and 'Fare' attribute
fig , axes = plt.subplots(figsize=(9,9))
plt.xlabel('Age')
plt.ylabel('Fare')
plt.grid()
plt.scatter(df_titanic['Age'], df_titanic['Fare'])
The Pandas dataframe contains the plot.scatter()
function that can be used to create a scatter plot. The following snippet demonstrates the use of this function to create an equivalent scatter plot:
df_titanic.plot.scatter(x='Age', y='Fare', figsize=(9,9))
You can use a scatter plot to get a visual indicator of the degree of correlation between two attributes. It is common to create a matrix of scatter plots of each attribute in the dataset against every other attribute; this, however, is only practical for a small number of attributes. Figure 3.16 shows the scatter plot of attributes that have the ideal strong positive and strong negative correlation. The ideal strong positive correlation would occur when most of the points lie along a straight line from the bottom-left corner to the top-right corner of the plot. The ideal strong negative correlation would occur when most of the points lie along a straight line from the top-left corner to the bottom-right corner of the plot.
The following snippet presents a function that can be used to generate a scatter plot matrix out of the contents of a Pandas dataframe. The function takes three arguments. The first is the dataframe object, the second is the height of the figure (in inches), and the third is the width of the figure (in inches):
# Generates a M X M scatterplot matrix of subplots.
def generate_scatterplot_matrix(df_input, size_h, size_w):
num_points, num_attributes = df_input.shape
fig, axes = plt.subplots(num_attributes, num_attributes, figsize=(size_h, size_w))
column_names = df_input.columns.values
for x in range(0, num_attributes):
for y in range(0, num_attributes):
axes[x , y].scatter(df_input.iloc[:,x], df_input.iloc[:,y])
# configure the ticks
axes[x , y].xaxis.set_visible(False)
axes[x , y].yaxis.set_visible(False)
# Set up ticks only on one side for the "edge" subplots…
if axes[x , y].is_first_col():
axes[x , y].yaxis.set_ticks_position('left')
axes[x , y].yaxis.set_visible(True)
axes[x , y].set_ylabel(column_names[x])
if axes[x , y].is_last_col():
axes[x , y].yaxis.set_ticks_position('right')
axes[x , y].yaxis.set_visible(True)
if axes[x , y].is_first_row():
axes[x , y].xaxis.set_ticks_position('top')
axes[x , y].xaxis.set_visible(True)
if axes[x , y].is_last_row():
axes[x , y].xaxis.set_ticks_position('bottom')
axes[x , y].xaxis.set_visible(True)
axes[x , y].set_xlabel(column_names[y])
return fig, axes
To see a scatter plot matrix, let's use the generate_scatter_plot
function on the popular Iris dataset. Recall from Chapter 2 that Scikit-learn contains a toy version of the Iris dataset. The dataset contains the heights and widths of the sepals and petals of iris flowers. The following snippet loads the Iris dataset into a dataframe and uses the generate_scatter_plot
function to create a scatter plot matrix. The resulting figure is depicted in Figure 3.17.
import sklearn
iris = sklearn.datasets.load_iris()
df_iris = pd.DataFrame(iris.data, columns = iris.feature_names)
generate_scatterplot_matrix (df_iris, 20, 20)
As you can see, Matplotlib does not have a built-in function to create a scatter plot matrix. The Pandas plotting module has a function called scatter_matrix()
that can be used to generate a scatter plot matrix from a dataframe. The following snippet demonstrates the use of the scatter_matrix()
function:
import sklearn.datasets
import pandas.plotting
iris = sklearn.datasets.load_iris()
df_iris = pd.DataFrame(iris.data, columns = iris.feature_names)
pandas.plotting.scatter_matrix(df_iris, figsize=(12, 12))
Scatter plots can also be used to visualize clusters within data. The following snippet creates a synthetic dataset of x, y values in four clusters and plots all the values in a scatter plot. The synthetic data is created using Scikit-learn's make_blobs()
function. You can learn more about this function at https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html
. Figure 3.18 depicts the resulting scatter plot.
# scatter plots can also be used to visualize
# groups within data. This is illustrated below
# using a synthetic dataset
from sklearn.datasets import make_blobs
coordinates, clusters = make_blobs(n_samples = 500, n_features = 2, centers=4, random_state=12)
coordinates_cluster1 = coordinates[clusters==0]
coordinates_cluster2 = coordinates[clusters==1]
coordinates_cluster3 = coordinates[clusters==2]
coordinates_cluster4 = coordinates[clusters==3]
fig , axes = plt.subplots(figsize=(9,9))
plt.xlabel('X Values')
plt.ylabel('Y Values')
plt.grid()
plt.scatter(coordinates_cluster1[:,0], coordinates_cluster1[:,1])
plt.scatter(coordinates_cluster2[:,0], coordinates_cluster2[:,1])
plt.scatter(coordinates_cluster3[:,0], coordinates_cluster3[:,1])
plt.scatter(coordinates_cluster4[:,0], coordinates_cluster4[:,1])
You can download the code files for this chapter from www.wiley.com/go/machinelearningawscloud
or from GitHub using the following URL:
pyplot
module within Matplotlib provides a high-level functional plotting interface.pyplot
module.3.17.183.152