3. Introduction to Plotting

3.1 Introduction

Data visualization is as much a part of the data processing step as the data presentation step. It is much easier to compare values when they are plotted than numeric values. By visualizing data we are able to get a better intuitive sense of the data than would be possible by looking at tables of values alone. Additionally, visualizations can bring to light hidden patterns in data, that you, the analyst, can exploit for model selection.

Concept Map

1. Prior knowledge

a. containers

b. using functions

c. subsetting and indexing

d. classes

2. matplotlib

3. seaborn

4. Pandas

Objectives

This chapter will cover:

1. matplotlib

2. seaborn

3. Plotting in Pandas

The quintessential example for creating visualizations of data is Anscombe’s quartet. This data set was created by English statistician Frank Anscombe to show the importance of statistical graphs.

The Anscombe data set contains four sets of data, each of which contains two continuous variables. Each set has the same mean, variance, correlation, and regression line. However, only when the data are visualized does it become obvious that each set does not follow the same pattern. This goes to show the benefits of visualizations and the pitfalls of looking at only summary statistics.

# the anscombe data set can be found in the seaborn library
import seaborn as sns
anscombe = sns.load_dataset("anscombe")
print(anscombe)

   dataset     x      y
0        I  10.0   8.04
1        I   8.0   6.95
2        I  13.0   7.58
3        I   9.0   8.81
4        I  11.0   8.33
5        I  14.0   9.96
6        I   6.0   7.24
7        I   4.0   4.26
8        I  12.0  10.84
9        I   7.0   4.82
10       I   5.0   5.68
11      II  10.0   9.14
12      II   8.0   8.14
13      II  13.0   8.74
14      II   9.0   8.77
15      II  11.0   9.26
16      II  14.0   8.10
17      II   6.0   6.13
18      II   4.0   3.10
19      II  12.0   9.13
20      II   7.0   7.26
21      II   5.0   4.74
22     III  10.0   7.46
23     III   8.0   6.77
24     III  13.0  12.74
25     III   9.0   7.11
26     III  11.0   7.81
27     III  14.0   8.84
28     III   6.0   6.08
29     III   4.0   5.39
30     III  12.0   8.15
31     III   7.0   6.42
32     III   5.0   5.73
33      IV   8.0   6.58
34      IV   8.0   5.76
35      IV   8.0   7.71
36      IV   8.0   8.84
37      IV   8.0   8.47
38      IV   8.0   7.04
39      IV   8.0   5.25
40      IV  19.0  12.50
41      IV   8.0   5.56
42      IV   8.0   7.91
43      IV   8.0   6.89

3.2 Matplotlib

matplotlib is Python’s fundamental plotting library. It is extremely flexible and gives the user full control over all elements of the plot.

Importing matplotlib’s plotting features is a little different from our previous package imports. You can think of it as importing the package matplotlib, with all of the plotting utilities being found under a subfolder (or subpackage) called pyplot. Just as we imported a package and gave it an abbreviated name, we can do the same with matplotlib.pyplot.

import matplotlib.pyplot as plt

The names of most of the basic plots will start with plt.plot. In our example, the plotting feature takes one vector for the x-values, and a corresponding vector for the y-values (Figure 3.1).

The plot of Anscombe data set I using Matplotlib is shown.

Figure 3.1 Anscombe data set I

# create a subset of the data
# contains only data set 1 from anscombe
dataset_1 = anscombe[anscombe['dataset'] == 'I']

plt.plot(dataset_1['x'], dataset_1['y'])

By default, plt.plot will draw lines. If we want it to draw circles (points) instead, we can pass an 'o' parameter to tell plt.plot to use points (Figure 3.2).

Graph shows an Anscombe data set I.

Figure 3.2 Anscombe data set I using points

plt.plot(dataset_1['x'], dataset_1['y'], 'o')

We can repeat this process for the rest of the datasets in our anscombe data.

# create subsets of the anscombe data
dataset_2 = anscombe[anscombe['dataset'] == 'II']
dataset_3 = anscombe[anscombe['dataset'] == 'III']
dataset_4 = anscombe[anscombe['dataset'] == 'IV']

At this point, we could make these plots individually, one at a time, but matplotlib offers a much handier way to create subplots. That is, you can specify the dimensions of your final figure, and put in smaller plots to fit the specified dimensions. In this way, you can present your results in a single figure, instead of completely separate ones.

The subplot syntax takes three parameters:

1. Number of rows in the figure for subplots

2. Number of columns in the figure for subplots

3. Subplot location

The subplot location is sequentially numbered, and plots are placed first in a left-to-right direction, then from top to bottom. If we try to plot this now (by just running the following code), we will get an empty figure (Figure 3.3). All we have done so far is create a figure and split the figure into a 2 × 2 grid where plots can be placed. Since no plots were created and inserted, nothing will show up.

Matplotlib figure shows four empty graphs. Each of the four graphs shows the vertical axis ranges from 0.00 to 1.00, in increments of 0.25. The horizontal axis ranges from 0.0 to 1.0, in increments of 0.2.

Figure 3.3 Matplotlib figure with four empty axes

# create the entire figure where our subplots will go
fig = plt.figure()

# tell the figure how the subplots should be laid out
# in the example, we will have
# 2 row of plots, and each row will have 2 plots

# subplot has 2 rows and 2 columns, plot location 1
axes1 = fig.add_subplot(2, 2, 1)

# subplot has 2 rows and 2 columns, plot location 2
axes2 = fig.add_subplot(2, 2, 2)

# subplot has 2 rows and 2 columns, plot location 3
axes3 = fig.add_subplot(2, 2, 3)

# subplot has 2 rows and 2 columns, plot location 4
axes4 = fig.add_subplot(2, 2, 4)

We can use the plot method on each axis to create our plot (Figure 3.4).

Graph shows a Matplotlib figure with four scatter plots of Anscombe datasets.

Figure 3.4 Matplotlib figure with four scatterplots

# add a plot to each of the axes created above
axes1.plot(dataset_1['x'], dataset_1['y'], 'o')
axes2.plot(dataset_2['x'], dataset_2['y'], 'o')
axes3.plot(dataset_3['x'], dataset_3['y'], 'o')
axes4.plot(dataset_4['x'], dataset_4['y'], 'o')

[<matplotlib.lines.Line2D at 0x7f8f96598b70>]

Finally, we can add a label to our subplots, and use the tight_layout to make sure the axes are apread apart from one another (Figure 3.5).

Graph shows a Matplotlib figure with four scatter plots for Anscombe data visualization.

Figure 3.5 Anscombe data visualization

# add a small title to each subplot
axes1.set_title("dataset_1")
axes2.set_title("dataset_2")
axes3.set_title("dataset_3")
axes4.set_title("dataset_4")

# add a title for the entire figure
fig.suptitle("Anscombe Data")

# use a tight layout
fig.tight_layout()

The Anscombe data visualizations illustrate why just looking at summary statistic values can be misleading. The moment the points are visualized, it becomes clear that even though each data set has the same summary statistic values, the relationships between points vastly differ across the data sets.

To finish off the Anscombe example, we can add set_xlabel() and set_ylabel() to each of the subplots to add x-and y-axis labels, just as we added a title to the figure.

Before we move on and learn how to create more statistical plots, you should become familiar with the matplotlib documentation on “Parts of a Figure.”1 I have reproduced its older figure in Figure 3.6, and the newer figure in Figure 3.7.

1. Parts of a matplotlib figure: http://matplotlib.org/faq/usage_faq.html#parts-of-a-figure

The older version of the parts of a matplotlib graph is shown.

Figure 3.6 One of the most confusing parts of plotting in Python is the use of the terms “axis” and “axes,” since they are pronounced the same way but refer to different parts of a figure. This was the older version of the “Parts of a Figure” figure from the matplotlib documentation.

The newer version of the parts of a matplotlib graph is shown.

Figure 3.7 A newer version of the “Parts of a Figure” depiction, with more details about the other aspects of a figure. Unlike the older figure, the newer one is completely created using matplotlib.

One of the most confusing parts of plotting in Python is the use of the terms “axis” and “axes,” especially when trying to verbally describe the different parts (since they are pronounced the same way). In the Anscombe example, each individual subplot plot has axes. The axes contain both an x-axis and a y-axis. All four subplots together make the figure.

The remainder of the chapter shows you how to create statistical plots, first with matplotlib and later using a higher-level plotting library that is based on matplotlib and specifically made for statistical graphics, seaborn.

3.3 Statistical Graphics Using matplotlib

The tips data we will be using for the next series of visualizations come from the seaborn library. This data set contains the amount of the tips that people leave for various variables. For example, the total cost of the bill, the size of the party, the day of the week, and the time of.

We can load this data set just as we did the Anscombe data set.

tips = sns.load_dataset("tips")
print(tips.head())

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

3.3.1 Univariate

In statistics jargon, the term “univariate” refers to a single variable.

3.3.1.1 Histograms

Histograms are the most common means of looking at a single variable. The values are “binned,” meaning they are grouped together and plotted to show the distribution of the variable (Figure 3.8).

Graph titled “Histogram of Total Bill” is shown.

Figure 3.8 Histogram using matplotlib

fig = plt.figure()
axes1 = fig.add_subplot(1, 1, 1)
axes1.hist(tips['total_bill'], bins=10)
axes1.set_title('Histogram of Total Bill')
axes1.set_xlabel('Frequency')
axes1.set_ylabel('Total Bill')
fig.show()

3.3.2 Bivariate

In statistics jargon, the term “bivariate” refers to a two variables.

3.3.2.1 Scatterplot

Scatterplots are used when a continuous variable is plotted against another continuous variable (Figure 3.9).

Graph titled “Scatterplot of Total Bill vs Tip” is shown.

Figure 3.9 Scatterplot using matplotlib

scatter_plot = plt.figure()
axes1 = scatter_plot.add_subplot(1, 1, 1)
axes1.scatter(tips['total_bill'], tips['tip'])
axes1.set_title('Scatterplot of Total Bill vs Tip')
axes1.set_xlabel('Total Bill')
axes1.set_ylabel('Tip')
scatter_plot.show()

3.3.2.2 Boxplot

Boxplots are used when a discrete variable is plotted against a continuous variable (Figure 3.10).

Graph titled “Boxplot of Tips by Sex” is shown.

Figure 3.10 Boxplot using matplotlib

boxplot = plt.figure()
axes1 = boxplot.add_subplot(1, 1, 1)
axes1.boxplot(
    # first argument of boxplot is the data
    # since we are plotting multiple pieces of data
    # we have to put each piece of data into a list
    [tips[tips['sex'] == 'Female']['tip'],
     tips[tips['sex'] == 'Male']['tip']],
    # we can then pass in an optional labels parameter
    # to label the data we passed
    labels=['Female', 'Male'])
axes1.set_xlabel('Sex')
axes1.set_ylabel('Tip')
axes1.set_title('Boxplot of Tips by Sex')
boxplot.show()

3.3.3 Multivariate Data

Plotting multivariate data is tricky, because there isn’t a panacea or template that can be used for every case. To illustrate the process of plotting multivariate data, let’s build on our earlier scatterplot. If we wanted to add another variable, say sex, one option would be to color the points based on the value of the third variable.

If we wanted to add a fourth variable, we could add size to the dots. The only caveat with using size as a variable is that humans are not very good at differentiating areas. Sure, if there’s an enormous dot next to a tiny one, your point will be conveyed, but smaller differences are difficult to distinguish, and may add clutter to your visualization. One way to reduce clutter is to add some value of transparency to the individual points, such that many overlapping points will show a darker region of a plot than less crowded areas.

The general rule of thumb is that different colors are much easier to distinguish than changes in size. If you have to use areas to convey differences in values, be sure that you are actually plotting relative areas. A common pitfall is to map a value to the radius of a circle for plots, but since the formula for a circle is πr2, your areas are actually based on a squared scale. That is not only misleading, but also wrong.

Colors are also difficult to pick. Humans do not perceive hues on a linear scale, so you need to think carefully when picking color palettes. Luckily matplotlib2 and seaborn3 come with their own set of color palettes, and tools like colorbrewer4 can help you pick good color palettes.

2. matplotlib colormaps: http://matplotlib.org/users/colormaps.html

3. seaborn color palettes: http://stanford.edu/~mwaskom/software/seaborn-dev/tutorial/color_palettes.html

4. colorbrewer color palettes http://colorbrewer2.org/

Figure 3.11 uses color to add a third variable, sex, to our scatter plot.

Graph titled “Total Bill vs Tip colored by Sex and sized by Size” is shown.

Figure 3.11 Scatterplot using matplotlib with color

# create a color variable based on sex
def recode_sex(sex):
    if sex == 'Female':
       return 0
    else:
       return 1

tips['sex_color'] = tips['sex'].apply(recode_sex)

scatter_plot = plt.figure()
axes1 = scatter_plot.add_subplot(1, 1, 1)
axes1.scatter(
    x=tips['total_bill'],
    y=tips['tip'],

    # set the size of the dots based on party size
    # we multiply the values by 10 to make the points bigger
    # and to emphasize the differences
    s=tips['size'] * 10,

    # set the color for the sex
    c=tips['sex_color'],

    # set the alpha value so points are more transparent
    # this helps with overlapping points
    alpha=0.5)

axes1.set_title('Total Bill vs Tip Colored by Sex and Sized by Size')
axes1.set_xlabel('Total Bill')
axes1.set_ylabel('Tip')
scatter_plot.show()

3.4 Seaborn

The matplotlib library can be thought of as the core foundational plotting tool in Python. seaborn builds on matplotlib by providing a higher-level interface for statistical graphics. It provides an interface to produce prettier and more complex visualizations with fewer lines of code.

The seaborn library is tightly integrated with Pandas and the rest of the PyData stack (numpy, scipy, statsmodels), making visualizations from any part of the data analysis process a breeze. Since seaborn is built on top of matplotlib, the user still has the ability to fine-tune the visualizations.

We’ve already loaded the seaborn library so that we could access its data sets.

# load seaborn if you have not done so already
import seaborn as sns

tips = sns.load_dataset("tips")

3.4.1 Univariate

Just like we did with the matplotlib examples, we will make a series of univariate plots.

3.4.1.1 Histograms

Histograms are created using sns.distplot5 (Figure 3.12).

5. seaborn distplot documentation: https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html#seaborn.distplot

Graph titled “Total Bill Histogram with Density Plot” is shown.

Figure 3.12 Seaborn distplot

# this subplots function is a shortcut for
# creating separate figure objects and
# adding individual subplots (axes) to the figure
hist, ax = plt.subplots()

# use the distplot function from seaborn to create our plot
ax = sns.distplot(tips['total_bill'])
ax.set_title('Total Bill Histogram with Density Plot')

plt.show() # we still need matplotlib.pyplot to show the figure

The default distplot will plot both a histogram and a density plot (using a kernel density estimation). If we just want the histogram, we can set the kde parameter to False. The results are shown in Figure 3.13.

Graph titled “Total Bill Histogram” is shown.

Figure 3.13 Seaborn distplot

hist, ax = plt.subplots()
ax = sns.distplot(tips['total_bill'], kde=False)
ax.set_title('Total Bill Histogram')
ax.set_xlabel('Total Bill')
ax.set_ylabel('Frequency')
plt.show()

3.4.1.2 Density Plot (Kernel Density Estimation)

Density plots are another way to visualize a univariate distribution (Figure 3.14). In essence, they are created by drawing a normal distribution centered at each data point, and then smoothing out the overlapping plots so that the area under the curve is 1.

Graph titled “Total Bill Density” is shown.

Figure 3.14 Seaborn density plot using distplot

den, ax = plt.subplots()
ax = sns.distplot(tips['total_bill'], hist=False)
ax.set_title('Total Bill Density')
ax.set_xlabel('Total Bill')
ax.set_ylabel('Unit Probability')
plt.show()

You can also sns.kdeplot if you just want a density plot.

3.4.1.3 Rug Plot

Rug plots are a one-dimensional representation of a variable’s distribution. They are typically used with other plots to enhance a visualization. Figure 3.15 shows a histogram overlaid with a density plot and a rug plot on the bottom.

Graph titled “Total Bill Histogram with Density and Rug Plot” is shown.

Figure 3.15 Seaborn distplot with rugs

hist_den_rug, ax = plt.subplots()
ax = sns.distplot(tips['total_bill'], rug=True)
ax.set_title('Total Bill Histogram with Density and Rug Plot')
ax.set_xlabel('Total Bill')
plt.show()

3.4.1.4 Count Plot (Bar Plot)

Bar plots are very similar to histograms, but instead of binning values to produce a distribution, bar plots can be used to count discrete variables. A count plot (Figure 3.16) is used for this purpose.

Seaborn plot titled “Count of days” is shown.

Figure 3.16 Seaborn count plot

count, ax = plt.subplots()
ax = sns.countplot('day', data=tips)
ax.set_title('Count of days')
ax.set_xlabel('Day of the Week')
ax.set_ylabel('Frequency')
plt.show()

3.4.2 Bivariate Data

We will now use the seaborn library to plot two variables.

3.4.2.1 Scatterplot

There are a few ways to create a scatterplot in seaborn. There is no explicit function named scatter. Instead, we use regplot. It will plot a scatterplot and also fit a regression line. If we set fit_reg=False, the visualization will show only the scatterplot (Figure 3.17).

Seaborn KDE plot is shown.

Figure 3.17 Seaborn scatterplot using regplot

scatter, ax = plt.subplots()
ax = sns.regplot(x='total_bill', ='tip', data=tips)
ax.set_title('Scatterplot of Total Bill and Tip')
ax.set_xlabel('Total Bill')
ax.set_ylabel('Tip')
plt.show()

A similar function, lmplot, can also create scatterplots. Internally, lmplot calls regplot, so regplot is a more general plotting function. The main difference is that regplot creates axes (Figure 3.6) whereas lmplot creates a figure (Figure 3.18).

Seaborn KDE plot using jointplot is shown.

Figure 3.18 Seaborn scatterplot using lmplot

fig = sns.lmplot(x='total_bill', y='tip', data=tips)
plt.show()

We can also create a scatterplot that includes a univariate plot on each axis using jointplot (Figure 3.19). One major difference is that jointplot does not return axes, so we do not need to create a figure with axes on which to place our plot. Instead, this function creates a JointGrid object.

Bar plot of average total bill for time of day is shown.

Figure 3.19 Seaborn scatterplot using jointplot

joint = sns.jointplot(x='total_bill', y='tip', data=tips)
joint.set_axis_labels(xlabel='Total Bill', ylabel='Tip')

# add a title, set font size,
# and move the text above the total bill axes
joint.fig.suptitle('Joint Plot of Total Bill and Tip',
                fontsize=10, y=1.03)

3.4.2.2 Hexbin Plot

Scatterplots are great for comparing two variables. However, sometimes there are too many points for a scatterplot to be meaningful. One way to get around this issue is to bin points on the plot together. Just as histograms can bin a variable to create a bar, so hexbin can bin two variables (Figure 3.20). A hexagon is used for this purpose because it is the most efficient shape to cover an arbitrary 2D surface. This is an example of seaborn building on top of matplotlib, as hexbin is a matplotlib function.

Seaborn box plot of total bill by time of day is shown.

Figure 3.20 Seaborn hexbin plot using jointplot

hexbin = sns.jointplot(x="total_bill", y="tip", data=tips, kind="hex")
hexbin.set_axis_labels(xlabel='Total Bill', ylabel='Tip')
hexbin.fig.suptitle('Hexbin Joint Plot of Total Bill and Tip',
                 fontsize=10, y=1.03)

3.4.2.3 2D Density Plot

You can also create a 2D kernel density plot. This kind of process is similar to how sns.kdeplot works, except it creates a density plot across two variables. The bivariate plot can be shown on its own (Figure 3.21), or you can place the two univariate plots next to each other using jointplot (Figure 3.22).

Seaborn KDE plot titled “Kernel Density Plot of Total Bill and Tip” is shown.

Figure 3.21 Seaborn KDE plot

Seaborn KDE plot using jointplot is shown.

Figure 3.22 Seaborn KDE plot using jointplot

kde, ax = plt.subplots()
ax = sns.kdeplot(data=tips['total_bill'],
                 data2=tips['tip'],
                 shade=True) # shade will fill in the contours
ax.set_title('Kernel Density Plot of Total Bill and Tip')
ax.set_xlabel('Total Bill')
ax.set_ylabel('Tip')
plt.show()

kde_joint = sns.jointplot(x='total_bill', y='tip',
                          data=tips, kind='kde')

3.4.2.4 Bar Plot

Bar Plots can also be used to show multiple variables. By default, barplot will calculate a mean (Figure 3.23), but you can pass any function into the estimator parameter. For example, you could pass in the numpy.std function to calculate the standard deviation.

Seaborn bar plot titled “Barplot of average total bill for time of day” is shown.

Figure 3.23 Seaborn bar plot using the default mean calculation

bar, ax = plt.subplots()
ax = sns.barplot(x='time', y='total_bill', data=tips)
ax.set_title('Bar plot of average total bill for time of day')
ax.set_xlabel('Time of day')
ax.set_ylabel('Average total bill')
plt.show()

3.4.2.5 Boxplot

Unlike the previously mentioned plots, a boxplot (Figure 3.24) shows multiple statistics: the minimum, first quartile, median, third quartile, maximum, and, if applicable, outliers based on the interquartile range.

Seaborn boxplot titled “Box plot of total bill by time of day” is shown.

Figure 3.24 Seaborn boxplot of total bill by time of day

The y parameter is optional. If it is omitted, the plotting function will create a single box in the plot.

box, ax = plt.subplots()
ax = sns.boxplot(x='time', y='total_bill', data=tips)
ax.set_title('Boxplot of total bill by time of day')
ax.set_xlabel('Time of day')
ax.set_ylabel('Total Bill')
plt.show()

3.4.2.6 Violin Plot

Boxplots are a classical statistical visualization, but they can obscure the underlying distribution of the data. Violin plots (Figure 3.25) are able to show the same values as a boxplot, but plot the “boxes” as a kernel density estimation. This can help retain more visual information about your data since only plotting summary statistics can be misleading, as seen by the Anscombe quartet.

Seaborn violin plot titled “Violin plot of total bill by time of day” is shown.

Figure 3.25 Seaborn violin plot of total bill by time of day

violin, ax = plt.subplots()
ax = sns.violinplot(x='time', y='total_bill', data=tips)
ax.set_title('Violin plot of total bill by time of day')
ax.set_xlabel('Time of day')
ax.set_ylabel('Total Bill')
plt.show()

3.4.2.7 Pairwise Relationships

When you have mostly numeric data, visualizing all of the pairwise relationships can be easily performed using pairplot. This function will plot a scatterplot between each pair of variables, and a histogram for the univariate data (Figure 3.26).

fig = sns.pairplot(tips)

Seaborn pair plot is displayed.

Figure 3.26 Seaborn pair plot

One drawback when using pairplot is that there is redundant information; that is, the top half of the the visualization is the same as the bottom half. We can use pairgrid to manually assign the plots for the top half and bottom half. This plot is shown in Figure 3.27.

Seaborn pair plot is displayed.

Figure 3.27 Seaborn pair plot with different plots on the upper and lower halves

pair_grid = sns.PairGrid(tips)
# we can use plt.scatter instead of sns.regplot
pair_grid = pair_grid.map_upper(sns.regplot)
pair_grid = pair_grid.map_lower(sns.kdeplot)
pair_grid = pair_grid.map_diag(sns.distplot, rug=True)
plt.show()

3.4.3 Multivariate Data

As mentioned in Section 3.3.3, there is no de facto template for plotting multivariate data. Possible ways to include more information are to use color, size, and shape to distinguish data within the plot.

3.4.3.1 Colors

When we are using violinplot, we can pass the hue parameter to color the plot by sex. We can reduce the redundant information by having each half of the violins represent a different sex, as shown in Figure 3.28. Try the following code with and without the split parameter.

Seaborn violin plot is shown.

Figure 3.28 Seaborn violin plot with hue parameter

violin, ax = plt.subplots()
ax = sns.violinplot(x='time', y='total_bill',
                    hue='sex', data=tips,
                    split=True)
plt.show()

The hue parameter can be passed into various other plotting functions as well. Figure 3.29 shows its use in a lmplot:

Seaborn Implot plot with hue parameter is shown.

Figure 3.29 Seaborn lmplot plot with hue parameter

# note the use of lmplot instead of regplot here
scatter = sns.lmplot(x='total_bill', y='tip', data=tips,
                     hue='sex', fit_reg=False)
plt.show()

We can make our pairwise plots a little more meaningful by passing one of the categorical variables as the hue parameter. Figure 3.30 shows this approach in our pairplot.

Seaborn pair plot is displayed.

Figure 3.30 Seaborn pair plot with hue parameter

fig = sns.pairplot(tips, hue='sex')

3.4.3.2 Size and Shape

Working with point sizes can be another means to add more information to a plot. However, this option should be used sparingly, since the human eye is not very good at comparing areas.

Let’s consider an example of how seaborn works with matplotlib function calls. If you look in the documentation for lmplot,6 you’ll see that lmplot takes a parameter called catter, line scatter, line_kws. In other words, there is a parameter in lmplot called scatter_kws and line_kws. Both of these parameters take a key–value pair—a Python dict (dictionary) to be more exact (Appendix K). Key–value pairs passed into scatter_kws are then passed on to the matplotlib function plt.scatter. This is how we would access the s parameter to change the size of the points, as we did in Section 3.3.3. This is shown in Figure 3.31.

6. https://web.stanford.edu/~mwaskom/software/seaborn/generated/seaborn.lmplot.html

Seaborn scatterplot passing scatter_kws is shown.

Figure 3.31 Seaborn scatterplot passing scatter_kws

scatter = sns.lmplot(x='total_bill', y='tip', data=tips,
                     fit_reg=False,
                     hue='sex',
                     scatter_kws={'s': tips['size']*10})
plt.show()

Also, when working with multiple variables, sometimes having two plot elements that show the same information is helpful. Figure 3.32 shows the use of color and shape to distinguish different values of the variable sex.

Seaborn scatterplot with markers passing scatter_kws is shown.

Figure 3.32 Seaborn scatterplot with markers passing scatter_kws

scatter = sns.lmplot(x='total_bill', y='tip', data=tips,
                     fit_reg=False, hue='sex', markers=['o', 'x'],
                     scatter_kws={'s': tips['size']*10})
plt.show()

3.4.3.3 Facets

What if we want to show more variables? Or if we know which plot we want for our visualization, but we want to make multiple plots over a categorical variable? Facets are designed to meet these needs. Instead of you needing to individually subset data and lay out the axes in a figure (as we did in Figure 3.5), facets in seaborn can handle this work for you.

To use facets, your data needs to be what Hadley Wickham7 calls “Tidy Data,”8 where each row represents an observation in the data, and each column is a variable (also known as “long data”).

7. http://hadley.nz/

8. http://vita.had.co.nz/papers/tidy-data.pdf

Figure 3.33 shows a re-creation of the Anscombe quartet data from Figure 3.5 in seaborn.

Seaborn Anscombe plot with facets is shown.

Figure 3.33 Seaborn Anscombe plot with facets

anscombe_plot = sns.lmplot(x='x', y='y', data=anscombe,
                           fit_reg=False,
                           col='dataset', col_wrap=2)

All we needed to do to create this visualization was to pass two more parameters into the scatterplot function in seaborn. The col parameter is the variable that the plot will facet by, and the col_wrap parameter creates a figure that has two columns. If we do not use the col_wrap parameter, all four plots will be plotted in the same row.

Section 3.4.2.1 discussed the differences between lmplot and regplot. lmplot is a figure-level function. In contrast, many of the plots we created in seaborn are axes-level functions. What this means is that not every plotting function will have col and col_wrap parameters for faceting. Instead, we must create a FacetGrid that knows which variable to facet on, and then supply the individual plot code for each facet. Figure 3.34 shows our manually created facet plot.

Seaborn plot shows manually created facets.

Figure 3.34 Seaborn plot with manually created facets

# create the FacetGrid
facet = sns.FacetGrid(tips, col='time')
# for each value in time, plot a histogram of total bill
facet.map(sns.distplot, 'total_bill', rug=True)
plt.show()

The individual facets need not be univariate plots, as seen in Figure 3.35.

Seaborn plot shows manually created facets containing multiple variables.

Figure 3.35 Seaborn plot with manually created facets that contain multiple variables

facet = sns.FacetGrid(tips, col='day', hue='sex')
facet = facet.map(plt.scatter, 'total_bill', 'tip')
facet = facet.add_legend()
plt.show()

If you wanted to continue working in seaborn, you could create the same plot using lmplot, as shown in Figure 3.36.

Four graph depicts the Seaborn plot shows manually created facets that contain multiple variations.

Figure 3.36 Seaborn plot with manually created facets that contain multiple variables

fig = sns.lmplot(x='total_bill', y='tip', data=tips, fit_reg=False,
                 hue='sex', col='day')
plt.show()

Another thing you can do with facets is to have one variable be faceted on the x-axis, and another variable faceted on the y-axis. We accomplish this by passing a row parameter. The result is shown in Figure 3.37.

Seaborn plot shows manually created facets with two variables.

Figure 3.37 Seaborn plot with manually created facets with two variables

facet = sns.FacetGrid(tips, col='time', row='smoker', hue='sex')
facet.map(plt.scatter, 'total_bill', 'tip')
plt.show()

If you do not want all of the hue elements to overlap (i.e., you want this behavior in scatterplots, but not violin plots), you can use the sns.factorplot function. The result is shown in Figure 3.38.

Seaborn plot shows manually created facets with two variables.

Figure 3.38 Seaborn plot with manually created facets with two variables

facet = sns.factorplot(x='day', y='total_bill', hue='sex', data=tips,
                       row='smoker', col='time', kind='violin')

3.5 Pandas Objects

Pandas objects also come equipped with their own plotting functions. Just as in seaborn, the plotting functions built into Pandas are just wrappers around matplotlib with preset values.

In general, plotting using Pandas follows the DataFrame.plot.PLOT_TYPE or Series.plot.PLOT_TYPE functions.

3.5.1 Histograms

Histograms can be created using the Series.plot.hist (Figure 3.39) or DataFrame.plot.hist (Figure 3.40) function.

Graph titled “Histogram of Total Bill” is shown.

Figure 3.39 Histogram of a Pandas Series

Histogram of a Pandas DataFrame is displayed.

Figure 3.40 Histogram of a Pandas DataFrame

# on a series
fig, ax = plt.subplots()
ax = tips['total_bill'].plot.hist()
plt.show()

# on a dataframe
# set an alpha channel transparency
# so we can see though the overlapping bars
fig, ax = plt.subplots()
ax = tips[['total_bill', 'tip']].plot.hist(alpha=0.5, bins=20, ax=ax)
plt.show()

3.5.2 Density Plot

The kernel density estimation (density) plot can be created with the DataFrame.plot.kde function (Figure 3.41).

Pandas KDE plot is displayed.

Figure 3.41 Pandas KDE plot

fig, ax = plt.subplots()
ax = tips['tip'].plot.kde()
plt.show()

3.5.3 Scatterplot

Scatterplots are created by using the DataFrame.plot.scatter function (Figure 3.42).

Pandas scatterplot is displayed.

Figure 3.42 Pandas scatterplot

fig, ax = plt.subplots()
ax = tips.plot.scatter(x='total_bill', y='tip', ax=ax)
plt.show()

3.5.4 Hexbin Plot

Hexbin plots are created using the Dataframe.plt.hexbin function (Figure 3.43).

Pandas scatterplot is displayed.

Figure 3.43 Pandas hexbin plot

fig, ax = plt.subplots()
ax = tips.plot.hexbin(x='total_bill', y='tip', ax=ax)
plt.show()

Grid size can be adjusted with the gridsize parameter (Figure 3.44).

Pandas scatterplot with modified grid size is displayed.

Figure 3.44 Pandas hexbin plot with modified grid size

fig, ax = plt.subplots()
ax = tips.plot.hexbin(x='total_bill', y='tip', gridsize=10, ax=ax)
plt.show()

3.5.5 Boxplot

Boxplots are created with the DataFrame.plot.box function (Figure 3.45).

Pandas boxplot is displayed.

Figure 3.45 Pandas boxplot

fig, ax = plt.subplots()
ax = tips.plot.box(ax=ax)
plt.show()

3.6 Seaborn Themes and Styles

The seaborn plots shown in this chapter have all used the default plot styles. We can change the plot style with the sns.set_style function. Typically, this function is run just once at the top of your code; all subsequent plots will use the same style set.

The styles that come with seaborn are darkgrid, whitegrid, dark, white, and ticks. Figure 3.46 shows a base plot, and Figure 3.47 shows a plot with the whitegrid style.

Seaborn style baseline without grids is displayed.

Figure 3.46 Seaborn style baseline

Seaborn style baseline with grids is displayed.

Figure 3.47 Seaborn style baseline

# initial plot for comparison
fig, ax = plt.subplots()
ax = sns.violinplot(x='time', y='total_bill',
                    hue='sex', data=tips,
                    split=True)
plt.show()

# set style and plot
sns.set_style('whitegrid')
fig, ax = plt.subplots()
ax = sns.violinplot(x='time', y='total_bill',
                    hue='sex', data=tips,
                    split=True)
plt.show()

The following code shows what all the styles look like (Figure 3.48).

Five plots depict different seaborn styles.

Figure 3.48 All seaborn styles

fig = plt.figure()
seaborn_styles = ['darkgrid', 'whitegrid', 'dark', 'white', 'ticks']
for idx, style in enumerate(seaborn_styles):
    plot_position = idx + 1
    with
sns.axes_style(style):
        ax = fig.add_subplot(2, 3, plot_position)
        violin = sns.violinplot(x='time', y='total_bill',
                                data=tips, ax=ax)
        violin.set_title(style)
fig.tight_layout()
plt.show()

3.7 Conclusion

Data visualization is an integral part of exploratory data analysis and data presentation. This chapter provided an introduction to the various ways to explore and present your data. As we continue through the book, we will learn about more complex visualizations.

There are a myriad of plotting and visualization resources available on the Internet. The seaborn documentation,9 Pandas visualization documentation,10 and matplotlib documentation11 all provide ways to further tweak your plots (e.g., colors, line thickness, legend placement, figure annotations). Other resources include colorbrewer12 to help pick good color schemes. The plotting libraries mentioned in this chapter also have various color schemes that can be used to highlight the content of your visualizations.

9. seaborn documentation: https://stanford.edu/~mwaskom/software/seaborn/api.html

10. Pandas plotting documentation: http://pandas.pydata.org/pandas-docs/stable/visualization.html

11. matplotlib documentation: http://matplotlib.org/api/index.html

12. colorbrewer: http://colorbrewer2.org

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.218.28