Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3
Data Visualization with Python

WHAT'S IN THIS CHAPTER

Introduction to Matplotlib
Learn to create histograms, bar charts, and pie charts
Learn to create box plots and scatter plots
Learn to use Pandas plotting functions

In the previous chapter, you learned about techniques to explore data and perform feature engineering with NumPy, Pandas, and Scikit-learn. In this chapter you will learn to use Matplotlib to visualize data. Data visualization helps to understand the characteristics and relationships between the features during the data exploration phase but becomes particularly important when you are dealing with very large datasets that have several hundreds of features.

NOTE To follow along with this chapter ensure you have installed Anaconda Navigator and Jupyter Notebook as described in Appendix A.
You can download the code files for this chapter from www.wiley.com/go/machinelearningawscloud or from GitHub using the following URL:

https://github.com/asmtechnology/awsmlbook-chapter3.git

Introducing Matplotlib

Matplotlib is a plotting library for Python that offers functionality to generate numerous types of plots and the ability to customize these plots. It was created by John Hunter to provide Python users with a plotting library with capabilities similar to MATLAB. MATLAB is the standard for data visualization in the scientific community. The Matplotlib package has an extensive codebase and was designed to address the needs of a variety of users and provide capabilities at different levels of abstraction. Some users may want to simply present Matplotlib with some data in an array and ask for a specific type of plot (such as a scatter plot) to be created using as few commands as possible; these users may not want to control detailed attributes of the plot (such as positioning, scaling, line style, color). On the other hand, some users may want the ability to control every single attribute of a plot, down to the level of individual pixels.

For most common plotting tasks, you will use the pyplot module within Matplotlib, which provides the highest level of abstraction. The pyplot module implements a functional interface, powered by a state-machine design—you use functions to set up attributes of the plotting engine such as colors and fonts, and these apply to all subsequent plots until you issue commands to change them. Beneath the pyplot level of abstraction is the object-oriented interface, which offers more flexibility.

The conventional alias for the pyplot module is plt, the alias for the Matplotlib package is mpl, and the alias for the Seaborn package is sns. The following statements demonstrate how to import Matplotlib and Seaborn in a Python project:

The reason to import both the pyplot submodule and the matplotlib package is to allow you to use functions from both the higher level of abstraction provided by pyplot and the lower-level object-oriented API exposed by Matplotlib. If you are using Matplotlib in a Jupyter Notebook, you must also add the %matplotlib inline statement before drawing any figures to ensure that plots render within the cells of the notebook.

Before looking at the components of a Matplotlib figure, let's examine the code to plot a simple curve using the pyplot module. The following snippet plots the function y₁=4x² + 2x + 2 and the function y₂ = 3x + 4 for values of x between 1 and 7. Figure 3.1 depicts the figure generated by these statements.

Graph depicts plotting two curves using Matplotlib, with X values on x axis and Y values on y axis. — **FIGURE 3.1** Plotting two curves using Matplotlib

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
 
import numpy as np
import pandas as pd
 
# prepare x values 
x = np.linspace(1, 7, 10)
 
# prepare y1 and y2
y1 = 4*x*x + 2*x + 2
y2 = 3*x + 4
 
# create a new figure
plt.figure(figsize=(7,7))
 
# plot y1 = 4*x*x + 2*x + 2
plt.plot(x, y1)
 
#y2 = 3*x + 4
plt.plot(x, y2)
 
# set up axis labels
plt.xlabel('X values')
plt.ylabel('Y values')
 
# show the plot.
plt.show()

The code snippet uses NumPy functions to generate ndarrays x, y1, and y2. A Matplotlib figure object with dimensions 7 × 7 inches is then created using the statement plt.figure(figsize=(7,7)). The first curve y₁=4x² + 2x + 2 is plotted using the plt.plot(x, y1) statement; the inputs to the plot function are 2D coordinates of the points that need to be plotted. The plt.xlabel('X values') and plt.ylabel('Y values') statements are for aesthetic purposes and add labels along the x- and y-axes of the plot, respectively. Finally, the plt.show() statement is used to render the figure.

Components of a Plot

Regardless of the type of plot you make with Matplotlib or whether you use the high-level pyplot interface or the lower-level object-oriented interface, there are common components and terminology associated with plots. This section presents an overview of these concepts. Figure 3.2 depicts the parts of a plot.

Graph depicts the components of a Matplotlib plot. — **FIGURE 3.2** Components of a Matplotlib plot

Figure

A figure object can be thought of as the entire diagram, with all the lines and text. You can think of it as the container; everything you plot using Matplotlib must belong to a figure. Figures are generally created using a pyplot function, even if you then want to manipulate the figure with the lower-level object-oriented API. Use the following pyplot command to create an empty figure:

plt.figure()

If you want the figure to have specific dimensions, you can provide a tuple with the x-axis and y-axis dimensions in inches:

plt.figure(figsize=(10,5))

Besides the figsize attribute, the pyplot figure function can accept various other attributes that you can use to customize aspects of the figure at the point of creation. You can find out more about these attributes at https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html.

The pyplot figure function creates a figure and makes it the active figure on which subsequent drawing operations will have effect. The figure is not displayed until there is something drawn to it. If you want to use the object-oriented API to control aspects of the figure after it has been created, you need to store a reference to the figure object in your code and then use the object-oriented API with this reference. The following code snippet creates a figure using the pyplot figure() function and then uses the object-oriented API to change the background color of the figure:

Axes

An axes object is what you would normally consider as a plot. It is the actual graph with various characteristics commonly associated with plots, such as data points, markers, colors, scales, etc. A figure object can contain multiple axes, which in effect are multiple subplots within a larger diagram. The following snippet uses pyplot to create a figure with a 2 × 2 grid of axes (subplots), and stores references to both the figure and the axes objects. Each axes object has its own title, which is different from the figure title. The object-oriented API is used to set the title for the figure and the four subplots within the figure. Figure 3.3 depicts the figure generated by executing the code snippet.

# create a figure with a 2 x 2 grid of axes objects
figure, axes_list  = plt.subplots(2,2, figsize=(9,9))
 
# title for the figure object
figure.suptitle('This is the title of the figure')  
 
# title for each axis
axes_list[0,0].set_title('Subplot 0')
axes_list[0,1].set_title('Subplot 1')
axes_list[1,0].set_title('Subplot 2')
axes_list[1,1].set_title('Subplot 3')

The Matplotlib figure and axes classes are the primary entry points for working with the lower-level object-oriented API.

Axis

The axis object represents a dimension within a subplot. Two-dimensional plots have two axis objects: one for the horizontal direction and the other for the vertical direction. The axes class, which is part of the object-oriented interface, provides several methods to modify attributes of the underlying x- and y-axis.

Axis Labels

The axis label is displayed beneath (or beside) each axis of the plot. Axis labels can be configured by calling the set_xlabel() and set_ylabel() methods on an axes object. The following snippet demonstrates the use of these methods:

figure, axes  = plt.subplots(figsize=(10,10))
axes.set_xlabel('Variable 1')
axes.set_ylabel('Variable 2')

If you do not want to use the object-oriented API exposed by the axes object, you can use the xlabel() and ylabel() functions from the pyplot high-level interface that set the axis labels for the active plot. The use of these functions is demonstrated in the following snippet:

plt.figure(figsize=(7,7))
plt.xlabel('X values')
plt.ylabel('Y values')

Grids

A grid is a set of horizontal and vertical lines inside the plot area that helps in reading values. The axes object provides a method called grid() that can be used to customize the appearance of the grid. You can find out more about the grid() method of the axes class at https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.grid.html. The pyplot module also provides a grid() function that is identical to the similarly named method of the axes class.

The following snippet demonstrates the use of the grid() method of the axes class:

figure, axes  = plt.subplots(figsize=(10,10))
axes.grid(b=True, which='both', linestyle='--', linewidth=1)

Figure 3.4 depicts two plots side by side, one with a grid and one without.

Graph depicts the comparison of plots with and without grids. — **FIGURE 3.4** Comparison of plots with and without grids

Title

The title is a string that is displayed on top of a plot. Both the figure object and the axes objects within the figure can have titles. Titles are displayed, by default, at the top of the figure or axes object, center aligned. The figure title can be changed use the suptitle() method of the figure class, or the suptitle() function of the pyplot module. These functions are identical to each other in syntax and function. You can find more information on the pyplot suptitle() function at https://matplotlib.org/api/_as_gen/matplotlib.pyplot.suptitle.html#matplotlib.pyplot.suptitle.

The following snippet demonstrates the use of the suptitle() method of the figure class, and the suptitle() function of the pyplot module:

# create a figure with one axes object.
# Call the suptitle() method on the figure instance.
figure, axes  = plt.subplots(figsize=(10,10))
figure.suptitle('Figure Title')
 
# create a new figure with one axes object
# use the pyplot suptitle() function
plt.figure(figsize=(7,7))
plt.suptitle('Figure Title')

The title of the axes object can be changed by calling the set_title() method on the axes object. You can find more information on using the set_title() method at https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_title.html#matplotlib.axes.Axes.set_title.

The pyplot module provides a convenience function called title(), which operates on the active axes object and has the same signature as the set_title() method of the axes class. The following snippet demonstrates the use of the set_title() method of the axes class and the title() function of the pyplot module:

# create a figure with one axes object.
# Call the set_title() method on the axes instance.
figure, axes  = plt.subplots(figsize=(10,10))
axes.suptitle(Axes Title')
 
# create a new figure with one axes object
# use the pyplot suptitle() function
plt.figure(figsize=(7,7))
plt.title(Axes Title')

Common Plots

In the previous section you learned the aspects of a typical Matplotlib plot, and that there are often different ways to configure a plot. The high-level pyplot API provides a functional interface and the lower-level Matplotlib API operates using an object-oriented interface. In this section, you will learn to create common types of plots using Matplotlib.

Histograms

A histogram is commonly used to visualize the distribution of a numeric variable. Histograms are not applicable when dealing with a categorical variable. The following snippet uses functions from the pyplot module to generate a histogram of the Age attribute of the popular Titanic dataset. The resulting figure is depicted in Figure 3.5.

Histogram of Passenger Age values, with passenger age on x axis and count on y axis. — **FIGURE 3.5** Histogram of Passenger Age values

import numpy as np
import pandas as pd
 
# load the contents of a file into a pandas Dataframe
input_file = './datasets/titanic_dataset/original/train.csv'
df_titanic = pd.read_csv(input_file)
 
# set the index
df_titanic.set_index("PassengerId", inplace=True)
 
# use pyplot functions to plot a histogram of the 'Age' attribute
fig = plt.figure(figsize=(7,7))
plt.xlabel('Passenger Age')
plt.ylabel('Count')
plt.grid()
 
n, bins, patches = plt.hist(df_titanic['Age'], histtype='bar', 
                            color='#0dc28d', align='mid', 
                            rwidth=0.90, bins=7)

The pyplot hist() function is used to create a histogram. The function takes several arguments. Some of the arguments that have been used in the preceding snippet are histtype='bar', which signifies that you want a standard histogram; rwidth=0.9, which leaves space between the bars; and bins=7, which is the number of bars. You can find a full list of parameters supported by the hist() function at https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html#matplotlib.pyplot.hist.

The appearance of the histogram is influenced by the binning strategy, which in turn is controlled by the value you pass into the bins parameter. If you pass an integer (as in the preceding snippet), Matplotlib will create the specified number of equal-width bins and use the boundaries (edges) of the bins to determine the number of values in each bin. You can optionally specify the bin edges instead of the number of bins. In most cases you simply specify the number of bins.

The pyplot hist() function returns a tuple, (n, bins, patches). The first element of the tuple, n, is an array with the counts for each bin. The second element, bins, is an array of floating-point values that contains the bin edges, and the third element, patches, is an array of rectangle objects that represent the low-level drawing primitives used by Matplotlib to make the bars. You can use the print() statement to inspect the contents of the tuple:

print (n)
[ 68. 178. 233. 134.  68.  26.   7.]
 
print (bins)
[0.42 
11.78857143 
23.15714286 
34.52571429 
45.89428571 
57.26285714  
68.63142857 
80.0]
 
print (patches[0])
Rectangle(xy=(0.988429, 0), width=10.2317, height=68, angle=0)

The following snippet uses pyplot functions to create a figure with four subplots (axes objects) and uses the object-oriented API to create histograms of the same data in each of the subplots, but with different numbers of bins. The result of this snippet is depicted in Figure 3.6.

Histograms of Passenger Age values created using different binning strategies, with passenger age on x axis and count on y axis. — **FIGURE 3.6** Histograms of Passenger Age values created using different binning strategies

 
# use pyplot functions and matplotlib object-oriented API
# to plot multiple histograms of the same data, with different
# binning strategies.
 
fig, axes_list = plt.subplots(2,2, figsize=(11,11))
 
# plot a histogram with 3 bins
axes_list[0,0].set_xlabel('Passenger Age')
axes_list[0,0].set_ylabel('Count')
axes_list[0,0].grid()
 
n1, bins1, patches1 = axes_list[0,0].hist(df_titanic['Age'], histtype='bar', 
                            color='#0dc28d', align='mid', 
                            rwidth=0.90, bins=3)
 
 
# plot a histogram with 10 bins
axes_list[0,1].set_xlabel('Passenger Age')
axes_list[0,1].set_ylabel('Count')
axes_list[0,1].grid()
n2, bins2, patches2 = axes_list[0,1].hist(df_titanic['Age'], histtype='bar', 
                            color='#0dc28d', align='mid', 
                            rwidth=0.90, bins=10)
 
# plot a histogram with 30 bins
axes_list[1,0].set_xlabel('Passenger Age')
axes_list[1,0].set_ylabel('Count')
axes_list[1,0].grid()
n3, bins3, patches3 = axes_list[1,0].hist(df_titanic['Age'], histtype='bar', 
                            color='#0dc28d', align='mid', 
                            rwidth=0.90, bins=30)
 
# plot a histogram with 100 bins
axes_list[1,1].set_xlabel('Passenger Age')
axes_list[1,1].set_ylabel('Count')
axes_list[1,1].grid()
n4, bins4, patches4 = axes_list[1,1].hist(df_titanic['Age'], histtype='bar', 
                            color='#0dc28d', align='mid', 
                            rwidth=0.90, bins=100)

As you can infer, the binning strategy significantly affects the appearance of the histogram and the inferences that you can make from the histogram. There is no set rule to the number of bins that must be used; often data scientists use a number of different binning strategies to reveal characteristics of the data that were not previously visible. A common rule of thumb is to set the number of bins to be the square root of the number of values as a starting point, and then update as necessary. A commonly used approach in statistics to select the bin width for histograms was proposed in 1981 by David Freedman and Persi Diaconis and is known as the Freedman-Diaconis rule. The general idea is to set the bin width to be 2 × IQR / (number of observations) ^1/3. Using this equation to compute the bin width, you can divide the range of values by the bin width to work out the number of bins. You can get more information on this rule from the original paper published in 1981 titled “On the histogram as a density estimator.” You can access a copy of the paper at https://statistics.stanford.edu/sites/g/files/sbiybj6031/f/EFS%20NSF%20159.pdf.

The Pandas dataframe object also provides limited plotting capabilities. These capabilities are built on top on Matplotlib, but in some situations, you may find the Pandas plotting functions simpler to use. The following snippet shows how you could create a simple histogram using the Pandas dataframe plot function:

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
 
# load the contents of a file into a pandas Dataframe
input_file = './datasets/titanic_dataset/original/train.csv'
df_titanic = pd.read_csv(input_file)
 
# set the index
df_titanic.set_index("PassengerId", inplace=True)
 
fig = plt.figure(figsize=(7,7))
plt.xlabel('Passenger Age')
plt.ylabel('Count')
 
df_titanic['Age'].plot.hist(color='#0dc28d', align='mid', 
                            rwidth=0.90, bins=7, grid=True)

Bar Chart

Bar charts are commonly used when you are dealing with a categorical variable. Each bar in a bar chart represents some information about a categorical attribute, such as a count, mean, or other measure. Bar charts can be used with both nominal and ordinal categorical data. When plotting a bar chart for nominal categorical data it is common practice to order the bars so that the height (or length) of the bars increases in an orderly fashion.

This is possible because a category that contains nominal data does not have any inherent order between the values, and hence you can freely order the placement of the bars to create a visually pleasing figure. Continuing with the use of the Titanic dataset from the previous section, the following snippet creates a bar plot of the Embarked attribute. The resulting bar chart is depicted in Figure 3.7.

Bar chart depicts Embarked attribute, with Embarkation point on x axis and Count on y axis. — **FIGURE 3.7** Bar chart of theEmbarked attribute

# use pyplot functions to plot a bar chart of the 'Embarked' attribute
fig = plt.figure(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('Count')
plt.grid()
 
values = df_titanic['Embarked'].unique()
counts = df_titanic['Embarked'].value_counts(dropna=False)
x_positions = np.arange(len(values))
 
plt.bar(x_positions, counts, align='center')
plt.xticks(x_positions, values)

Plotting bar charts is significantly simpler with the Pandas dataframe plot function. The following snippet demonstrates how you can create the same bar chart using Pandas functions:

# use Pandas dataframe functions to plot a bar chart of the 'Embarked' attribute
fig = plt.figure(figsize=(7,7))
plt.xlabel('Embarkation Point')
plt.ylabel('Count')
plt.grid()
 
df_titanic['Embarked'].value_counts(dropna=False).plot.bar(grid=True)

Grouped Bar Chart

If you want to show information about different subgroups within each category, a grouped bar chart can be used. A grouped bar chart, as its name suggests, is a chart with groups of bars clustered together. Each bar group provides information on one category, and the length of bars within the group provides information on the individual subgroups within the category.

For example, a grouped bar chart could be used to visualize the distribution of the number of individuals who survived and the number who did not, for each embarkation point. The following snippet creates a grouped bar chart for the Embarked attribute with two bars in each group. The resulting bar chart is depicted in Figure 3.8.

# a grouped bar chart for the Embarked attribute with
# two bars per group.
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
 
values = df_titanic['Embarked'].dropna().unique()
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
 
x_positions = np.arange(len(values))
 
fig = plt.figure(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('Count')
plt.grid()
 
bar_width = 0.35
 
plt.bar(x_positions, embarked_counts_survived, bar_width, color='#009bdb', label='Survived')
plt.bar(x_positions + bar_width, embarked_counts_not_survived, bar_width, color='#00974d', label='Not Survived')
 
plt.xticks(x_positions + bar_width, values)
plt.legend()
 
plt.show()

The following snippet shows how to use the Pandas plotting functions to draw the same grouped bar chart:

# a grouped bar chart for the Embarked attribute with
# two bars per group.
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
 
values = df_titanic['Embarked'].dropna().unique()
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
 
embarked_counts_survived.name = 'Survived'
embarked_counts_not_survived.name = 'Not Survived'
df = pd.concat([embarked_counts_survived, embarked_counts_not_survived], axis=1)
 
fig, axes = plt.subplots(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('Count')
 
df.plot.bar(grid=True, ax=axes, color=['#009bdb', '#00974d'])

Stacked Bar Chart

A stacked bar chart provides another way to visualize the same information. Instead of having multiple bars in clustered groups, a stacked bar chart uses one bar per categorical value and splits the bar to depict the distribution of subgroups within the category.

The following snippet creates a stacked bar chart for the Embarked attribute showing the distribution of survivors from each embarkation point. The resulting bar chart is depicted in Figure 3.9.

Bar chart depicts stacked bar chart of embarked attribute, with Embarkation point on x axis and Count on y axis. — **FIGURE 3.9** Stacked bar chart of the Embarked attribute

# a stacked bar chart for the Embarked attribute
# showing the number of survivors in each category
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
 
values = df_titanic['Embarked'].dropna().unique()
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
 
x_positions = np.arange(len(values))
 
fig = plt.figure(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('Count')
plt.grid()
 
plt.bar(x_positions, embarked_counts_survived, color='#009bdb', label='Survived')
plt.bar(x_positions, embarked_counts_not_survived, color='#00974d', label='Not Survived', bottom=embarked_counts_survived)
 
plt.xticks(x_positions, values)
plt.legend()
 
plt.show()

Creating a stacked bar chart with Pandas plotting functions is simply a matter of adding the stacked=True attribute to the argument list of the dataframe.plot.bar() function. The following snippet using Pandas plotting functions to draw the same stacked bar chart:

# a stacked percentage bar chart for the Embarked attribute
# showing the number of survivors in each category
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
 
values = df_titanic['Embarked'].dropna().unique()
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
 
embarked_counts_survived.name = 'Survived'
embarked_counts_not_survived.name = 'Not Survived'
df = pd.concat([embarked_counts_survived, embarked_counts_not_survived], axis=1)
 
fig, axes = plt.subplots(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('Count')
 
df.plot.bar(stacked=True, grid=True, ax=axes, color=['#009bdb', '#00974d'])

The power of Pandas plotting functions over pyplot and Matplotlib is evident with stacked bar charts, especially if you have more than two groups per bar. With Matplotlib and pyplot functions, you will have to plot each group on top of the other using multiple plt.bar() statements. With Pandas, all you need to do is get your data in a dataframe and make a single call to dataframe.plot.bar().

Stacked Percentage Bar Chart

If you want to show the percentage contribution of each subgroup within the categories, you can use a stacked percentage bar chart. The bars in a stacked percentage bar chart are all the same height. The following snippet creates a stacked percentage bar chart for the Embarked attribute showing the percentage of survivors from each embarkation point. The resulting bar chart is depicted in Figure 3.10.

# a stacked percentage bar chart for the Embarked attribute
# showing the number of survivors in each category
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
 
values = df_titanic['Embarked'].dropna().unique()
counts = df_titanic['Embarked'].value_counts(dropna=True)
 
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
 
embarked_counts_survived_percent = embarked_counts_survived / counts * 100
embarked_counts_not_survived_percent = embarked_counts_not_survived / counts * 100
 
x_positions = np.arange(len(values))
 
fig = plt.figure(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('Percentage')
plt.grid()
 
plt.bar(x_positions, embarked_counts_survived_percent, color='#009bdb', label='Survived')
plt.bar(x_positions, embarked_counts_not_survived_percent, color='#00974d', label='Not Survived', bottom=embarked_counts_survived_percent)
 
plt.xticks(x_positions, values)
plt.legend()
 
plt.show()

The following snippet creates an equivalent chart using Pandas plotting functions:

# a stacked percentage bar chart for the Embarked attribute
# showing the number of survivors in each category
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
 
values = df_titanic['Embarked'].dropna().unique()
counts = df_titanic['Embarked'].value_counts(dropna=True)
 
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
 
embarked_counts_survived_percent = embarked_counts_survived / counts * 100
embarked_counts_not_survived_percent = embarked_counts_not_survived / counts * 100
 
embarked_counts_survived_percent.name = 'Survived'
embarked_counts_not_survived_percent.name = 'Not Survived'
df = pd.concat([embarked_counts_survived_percent, embarked_counts_not_survived_percent], axis=1)
 
fig, axes = plt.subplots(figsize=(9,9))
plt.xlabel('Embarkation Point')
plt.ylabel('% Survived ')
 
df.plot.bar(stacked=True, grid=True, ax=axes, color=['#009bdb', '#00974d'])

Pie Charts

Pie charts are an alternative to bar charts and can be used to plot the proportion of unique values in a categorical attribute. The pyplot module provides the pie() function that can be used to create a pie chart. The pie function of the pyplot module uses the underlying pie() method of the axes class. You can access the documentation for the pyplot pie() function at https://matplotlib.org/api/_as_gen/matplotlib.pyplot.pie.html.

The following snippet uses the pyplot pie() function on the contents of the Embarked column of the Titanic dataset to create a pie chart that depicts the percentage of passengers boarding from each embarkation point. Figure 3.11 depicts the resulting pie chart.

Pie chart depicts the proportion of passengers embarking from different ports. — **FIGURE 3.11** Pie chart of proportion of passengers embarking from different ports

# use pyplot functions to plot a pie chart of the 'Embarked' attribute
fig = plt.figure(figsize=(9,9))
 
embarkation_ports = df_titanic['Embarked'].dropna().unique()
counts = df_titanic['Embarked'].value_counts(dropna=True)
 
total_embarked = counts.values.sum()
counts_percentage = counts / total_embarked * 100
 
counts_percentage.values
plt.pie(counts_percentage.values, 
        labels=embarkation_ports, 
        autopct='%1.1f%%', shadow=True, startangle=90)

You can also use the Pandas dataframe plot.pie() function to create pie charts. If your data is already in a Pandas dataframe, this approach will require significantly less code. The downside to using the Pandas plotting function is the reduced level of options to customize the pie chart. The following snippet uses Pandas plotting functions to make the same pie chart:

# use Pandas functions to plot a pie chart of the 'Embarked' attribute
fig = plt.figure(figsize=(7,7))
df_titanic['Embarked'].value_counts(dropna=True).plot.pie()

When the target attribute of a dataset is binary, a pie chart can help convey the distribution of values in each categorical attribute, grouped by the target binary attribute. The following snippet uses Matplotlib's Axes.pie() method to create three pie charts showing the percentage of survivors from the three embarkation ports in the Titanic dataset. Figure 3.12 depicts the resulting pie charts.

Pie charts depict the proportion of survivors from each embarkation point — **FIGURE 3.12** Pie charts showing the proportion of survivors from each embarkation point

# three pie charts, showing the proportion
# of survivors for each embarkation point
# S = Southampton
# C = Cherbourg
# Q = Queenstown
S_df = df_titanic[df_titanic['Embarked'] == 'S']
C_df = df_titanic[df_titanic['Embarked'] == 'C']
Q_df = df_titanic[df_titanic['Embarked'] == 'Q']
 
S_Total_Embarked = S_df['Embarked'].count()
S_Survived_Count = S_df[S_df['Survived']==1].Embarked.count()
S_Survived_Percentage = S_Survived_Count / S_Total_Embarked * 100
S_Not_Survived_Percentage = 100.0 - S_Survived_Percentage
 
C_Total_Embarked = C_df['Embarked'].count()
C_Survived_Count = C_df[C_df['Survived']==1].Embarked.count()
C_Survived_Percentage = C_Survived_Count / C_Total_Embarked * 100
C_Not_Survived_Percentage = 100.0 - C_Survived_Percentage
 
Q_Total_Embarked = Q_df['Embarked'].count()
Q_Survived_Count = Q_df[Q_df['Survived']==1].Embarked.count()
Q_Survived_Percentage = Q_Survived_Count / Q_Total_Embarked * 100
Q_Not_Survived_Percentage = 100.0 - Q_Survived_Percentage
 
 
fig, axes = plt.subplots(1, 3, figsize=(16,4))
 
Wedge_Labels = ['Survived', 'Not Survived']
S_Wedge_Sizes = [S_Survived_Percentage, S_Not_Survived_Percentage]
C_Wedge_Sizes = [C_Survived_Percentage, C_Not_Survived_Percentage]
Q_Wedge_Sizes = [Q_Survived_Percentage, Q_Not_Survived_Percentage]
 
axes[0].pie(S_Wedge_Sizes, labels=Wedge_Labels, autopct='%1.1f%%', shadow=True, startangle=90)
axes[0].set_title('Southampton')
 
axes[1].pie(C_Wedge_Sizes, labels=Wedge_Labels, autopct='%1.1f%%', shadow=True, startangle=90)
axes[1].set_title('Cherbourg')
 
axes[2].pie(Q_Wedge_Sizes, labels=Wedge_Labels, autopct='%1.1f%%', shadow=True, startangle=90)
axes[2].set_title('Queenstown')

The same pie charts can be created using far less code if you were to use the Pandas plotting functions, at the expense of loss in customizability. The following snippet will generate the same chart using Pandas plotting functions:

# three pie charts, showing the proportion
# of survivors for each embarkation point
# S = Southampton
# C = Cherbourg
# Q = Queenstown
 
survived_df = df_titanic[df_titanic['Survived']==1]
not_survived_df = df_titanic[df_titanic['Survived']==0]
 
values = df_titanic['Embarked'].dropna().unique()
counts = df_titanic['Embarked'].value_counts(dropna=True)
 
embarked_counts_survived = survived_df['Embarked'].value_counts(dropna=True)
embarked_counts_not_survived = not_survived_df['Embarked'].value_counts(dropna=True)
 
embarked_counts_survived_percent = embarked_counts_survived / counts * 100
embarked_counts_not_survived_percent = embarked_counts_not_survived / counts * 100
 
embarked_counts_survived_percent.name = 'Survived'
embarked_counts_not_survived_percent.name = 'Not Survived'
df = pd.concat([embarked_counts_survived_percent, embarked_counts_not_survived_percent], axis=1)
 
df.T.plot.pie(sharex=True, subplots=True, figsize=(16, 4))

Box Plot

A box plot provides a way to view the distribution of a numerical attribute. It was created in 1969 by John Tukey. A box plot is used to find out if the values of the attribute are symmetrically distributed, the overall spread of the values, and information on outliers. A box plot provides information on five statistical qualities of the attribute values:

First quartile: 25% of the values of the attribute are less than this number. This is also known as the 25^th percentile.
Second quartile: 50% of the values of the attribute are less than this number. This is also known as the 50^th percentile, or the median value.
Third quartile: 75% of the values of the attribute are less than this number. This is also known as the 75^th percentile.
Minimum: This value is computed using the formula Min = Q1 – 1.5 * IQR, where IQR is the inter-quartile range, defined as Q3 – Q2. Any attribute values that are lower than this minimum will be treated as outliers in the plot.
Maximum: This value is computed using the formula Min = Q3 + 1.5 * IQR, where IQR is the inter-quartile range, defined as Q3 – Q2. Any attribute values that are greater than this maximum will be treated as outliers in the plot.

The pyplot module provides the boxplot() function that can be used to create a box plot. The boxplot() function of the pyplot module uses the underlying boxplot() method of the axes class. You can access the documentation for the pyplot boxplot() function at https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html.

The following snippet uses the pyplot boxplot() function on the contents of the Age column of the Titanic dataset to create a box plot that depicts the spreads of the values of the attribute. Figure 3.13 depicts the resulting box plot.

Graph depicts box plot showing the distribution of the age attribute. — **FIGURE 3.13** Box plot showing the distribution of the Age attribute

# use pyplot functions to create a box plot of the 'Embarked' attribute
fig , axes = plt.subplots(figsize=(9,9))
box_plot = plt.boxplot(df_titanic['Age'].dropna())
axes.set_xticklabels(['Age'])

The following snippet will generate the same box plot using Pandas plotting functions:

df_titanic.boxplot(column = 'Age', figsize=(9,9), grid=False);

The compact nature of box plots makes them extremely useful to compare the distributions of different numerical attributes, or the distributions of subgroups within a numerical attribute. The following snippet uses pyplot functions to create two box plots of the Age attribute, one for those that survived the Titanic disaster, and the other for those that did not. Figure 3.14 depicts the resulting box plots.

Graph depicts box plots of the age attribute comparing the distribution of survivors with those who did not survive the Titanic disaster. — **FIGURE 3.14** Box plots of the Age attribute comparing the distribution of survivors with those who did not survive the Titanic disaster

# compare box plots of the Age attribute for those who survived
# against those that did not.
survived_df = df_titanic[df_titanic['Survived']==1].dropna()
not_survived_df = df_titanic[df_titanic['Survived']==0].dropna()
 
fig , axes = plt.subplots(figsize=(9,9))
box_plot = plt.boxplot([survived_df['Age'], not_survived_df['Age']], 
                       labels=['Survived', 'Not Survived'])

The following snippet will generate the same box plot using Pandas plotting functions:

df_titanic.boxplot(column = 'Age', by = 'Survived', figsize=(9,9), grid=False);

Scatter Plots

A scatter plot is a 2D plot that plots two continuous numeric attributes against each other. One attribute is plotted along the x-axis and the other is plotted along the y-axis. Scatter plots are a collection of points and are typically used to plot the correlation between variables, with each point of the scatter plot representing the value of two variables. Scatter plots can also be used to visualize the grouping of data. The following snippet creates a scatter plot of the Age and Fare attributes from the Titanic dataset after normalizing the values and imputing missing values with the mean of the attribute. Figure 3.15 depicts the resulting scatter plot.

Graph depicts scatter plot of the Age attribute against the Fare attribute, with Age on x axis and Fare on y axis. — **FIGURE 3.15** Scatter plot of the Age attribute against the Fare attribute

# impute missing values:
# Age with the median age
# Fare with the mean fare
median_age = df_titanic['Age'].median()
df_titanic["Age"].fillna(median_age, inplace=True)
 
mean_fare = df_titanic['Fare'].mean()
df_titanic["Fare"].fillna(mean_fare, inplace=True)
 
# use pyplot functions to create a scatter plot of the 'Age' and 'Fare' attribute
fig , axes = plt.subplots(figsize=(9,9))
plt.xlabel('Age')
plt.ylabel('Fare')
plt.grid()
 
plt.scatter(df_titanic['Age'], df_titanic['Fare'])

The Pandas dataframe contains the plot.scatter() function that can be used to create a scatter plot. The following snippet demonstrates the use of this function to create an equivalent scatter plot:

df_titanic.plot.scatter(x='Age', y='Fare', figsize=(9,9))

You can use a scatter plot to get a visual indicator of the degree of correlation between two attributes. It is common to create a matrix of scatter plots of each attribute in the dataset against every other attribute; this, however, is only practical for a small number of attributes. Figure 3.16 shows the scatter plot of attributes that have the ideal strong positive and strong negative correlation. The ideal strong positive correlation would occur when most of the points lie along a straight line from the bottom-left corner to the top-right corner of the plot. The ideal strong negative correlation would occur when most of the points lie along a straight line from the top-left corner to the bottom-right corner of the plot.

Graphs depict scatter plots depicting the ideal strong positive and strong negative correlation. — **FIGURE 3.16** Scatter plots depicting the ideal strong positive and strong negative correlation

The following snippet presents a function that can be used to generate a scatter plot matrix out of the contents of a Pandas dataframe. The function takes three arguments. The first is the dataframe object, the second is the height of the figure (in inches), and the third is the width of the figure (in inches):

# Generates a M X M scatterplot matrix of subplots. 
def generate_scatterplot_matrix(df_input, size_h, size_w):
    num_points, num_attributes = df_input.shape
    fig, axes = plt.subplots(num_attributes, num_attributes, figsize=(size_h, size_w))
 
    column_names = df_input.columns.values
    
    for x in range(0, num_attributes):
        for y in range(0, num_attributes):
            axes[x , y].scatter(df_input.iloc[:,x], df_input.iloc[:,y])
            
            # configure the ticks
            axes[x , y].xaxis.set_visible(False)
            axes[x , y].yaxis.set_visible(False)
 
            # Set up ticks only on one side for the "edge" subplots…
            if axes[x , y].is_first_col():
                axes[x , y].yaxis.set_ticks_position('left')
                axes[x , y].yaxis.set_visible(True)
                axes[x , y].set_ylabel(column_names[x])
                
            if axes[x , y].is_last_col():
                axes[x , y].yaxis.set_ticks_position('right')
                axes[x , y].yaxis.set_visible(True)
            
            if axes[x , y].is_first_row():
                axes[x , y].xaxis.set_ticks_position('top')
                axes[x , y].xaxis.set_visible(True)
                
            if axes[x , y].is_last_row():
                axes[x , y].xaxis.set_ticks_position('bottom')
                axes[x , y].xaxis.set_visible(True)
                axes[x , y].set_xlabel(column_names[y])
                
 
    return fig, axes

To see a scatter plot matrix, let's use the generate_scatter_plot function on the popular Iris dataset. Recall from Chapter 2 that Scikit-learn contains a toy version of the Iris dataset. The dataset contains the heights and widths of the sepals and petals of iris flowers. The following snippet loads the Iris dataset into a dataframe and uses the generate_scatter_plot function to create a scatter plot matrix. The resulting figure is depicted in Figure 3.17.

Graphs depict scatter plot matrix of the features of the Iris dataset. — **FIGURE 3.17** Scatter plot matrix of the features of the Iris dataset

import sklearn
iris = sklearn.datasets.load_iris()
df_iris = pd.DataFrame(iris.data, columns = iris.feature_names)
generate_scatterplot_matrix (df_iris, 20, 20)

As you can see, Matplotlib does not have a built-in function to create a scatter plot matrix. The Pandas plotting module has a function called scatter_matrix() that can be used to generate a scatter plot matrix from a dataframe. The following snippet demonstrates the use of the scatter_matrix() function:

import sklearn.datasets
import pandas.plotting
 
iris = sklearn.datasets.load_iris()
df_iris = pd.DataFrame(iris.data, columns = iris.feature_names)
 
pandas.plotting.scatter_matrix(df_iris, figsize=(12, 12))

Scatter plots can also be used to visualize clusters within data. The following snippet creates a synthetic dataset of x, y values in four clusters and plots all the values in a scatter plot. The synthetic data is created using Scikit-learn's make_blobs() function. You can learn more about this function at https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html. Figure 3.18 depicts the resulting scatter plot.

Graph depicts scatter plot of four clusters of data. — **FIGURE 3.18** Scatter plot of four clusters of data

# scatter plots can also be used to visualize 
# groups within data. This is illustrated below
# using a synthetic dataset
 
from sklearn.datasets import make_blobs
coordinates, clusters = make_blobs(n_samples = 500, n_features = 2, centers=4, random_state=12)
 
coordinates_cluster1 = coordinates[clusters==0]
coordinates_cluster2 = coordinates[clusters==1]
coordinates_cluster3 = coordinates[clusters==2]
coordinates_cluster4 = coordinates[clusters==3]
 
fig , axes = plt.subplots(figsize=(9,9))
plt.xlabel('X Values')
plt.ylabel('Y Values')
plt.grid()
 
plt.scatter(coordinates_cluster1[:,0], coordinates_cluster1[:,1])
plt.scatter(coordinates_cluster2[:,0], coordinates_cluster2[:,1])
plt.scatter(coordinates_cluster3[:,0], coordinates_cluster3[:,1])
plt.scatter(coordinates_cluster4[:,0], coordinates_cluster4[:,1])

NOTE To follow along with this chapter ensure you have installed Anaconda Navigator and Jupyter Notebook as described in Appendix A.
You can download the code files for this chapter from www.wiley.com/go/machinelearningawscloud or from GitHub using the following URL:

https://github.com/asmtechnology/awsmlbook-chapter3.git

Summary

Matplotlib is a plotting library for Python that offers functionality to generate numerous types of plots and the ability to customize these plots.
The pyplot module within Matplotlib provides a high-level functional plotting interface.
Matplotlib also provides a lower-level object-oriented API that can be used on its own, or in conjunction with the pyplot module.
Seaborn is another Python plotting package that builds on top of Matplotlib.
A histogram is commonly used to visualize the distribution of a numeric variable. Histograms are not applicable when dealing with categorical variables.
The binning strategy significantly affects the appearance of a histogram.
Bar charts are commonly used when you are dealing with categorical variables. Bar charts can be used with both nominal and ordinal categorical data.
A stacked bar chart uses one bar per categorical value and splits the bar to depict the distribution of subgroups within the category.
A box plot provides a way to view the distribution of a numerical attribute.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.