Data visualization is as much a part of the data processing step as the data presentation step. It is much easier to compare values when they are plotted than numeric values. By visualizing data we are able to get a better intuitive sense of the data than would be possible by looking at tables of values alone. Additionally, visualizations can bring to light hidden patterns in data, that you, the analyst, can exploit for model selection.
1. Prior knowledge
a. containers
b. using functions
c. subsetting and indexing
d. classes
2. matplotlib
3. seaborn
4. Pandas
This chapter will cover:
1. matplotlib
2. seaborn
3. Plotting in Pandas
The quintessential example for creating visualizations of data is Anscombe’s quartet. This data set was created by English statistician Frank Anscombe to show the importance of statistical graphs.
The Anscombe data set contains four sets of data, each of which contains two continuous variables. Each set has the same mean, variance, correlation, and regression line. However, only when the data are visualized does it become obvious that each set does not follow the same pattern. This goes to show the benefits of visualizations and the pitfalls of looking at only summary statistics.
# the anscombe data set can be found in the seaborn library
import seaborn as sns
anscombe = sns.load_dataset("anscombe")
print(anscombe)
matplotlib
is Python’s fundamental plotting library. It is extremely flexible and gives the user full control over all elements of the plot.
Importing matplotlib
’s plotting features is a little different from our previous package imports. You can think of it as importing the package matplotlib
, with all of the plotting utilities being found under a subfolder (or subpackage) called pyplot
. Just as we imported a package and gave it an abbreviated name, we can do the same with matplotlib.pyplot
.
import matplotlib.pyplot as plt
The names of most of the basic plots will start with plt.plot
. In our example, the plotting feature takes one vector for the x-values, and a corresponding vector for the y-values (Figure 3.1).
# create a subset of the data
# contains only data set 1 from anscombe
dataset_1 = anscombe[anscombe['dataset'] == 'I']
plt.plot(dataset_1['x'], dataset_1['y'])
By default, plt.plot
will draw lines. If we want it to draw circles (points) instead, we can pass an 'o'
parameter to tell plt.plot
to use points (Figure 3.2).
plt.plot(dataset_1['x'], dataset_1['y'], 'o')
We can repeat this process for the rest of the dataset
s in our anscombe data.
# create subsets of the anscombe data
dataset_2 = anscombe[anscombe['dataset'] == 'II']
dataset_3 = anscombe[anscombe['dataset'] == 'III']
dataset_4 = anscombe[anscombe['dataset'] == 'IV']
At this point, we could make these plots individually, one at a time, but matplotlib
offers a much handier way to create subplots. That is, you can specify the dimensions of your final figure, and put in smaller plots to fit the specified dimensions. In this way, you can present your results in a single figure, instead of completely separate ones.
The subplot
syntax takes three parameters:
1. Number of rows in the figure for subplots
2. Number of columns in the figure for subplots
3. Subplot location
The subplot location is sequentially numbered, and plots are placed first in a left-to-right direction, then from top to bottom. If we try to plot this now (by just running the following code), we will get an empty figure (Figure 3.3). All we have done so far is create a figure and split the figure into a 2 × 2 grid where plots can be placed. Since no plots were created and inserted, nothing will show up.
# create the entire figure where our subplots will go
fig = plt.figure()
# tell the figure how the subplots should be laid out
# in the example, we will have
# 2 row of plots, and each row will have 2 plots
# subplot has 2 rows and 2 columns, plot location 1
axes1 = fig.add_subplot(2, 2, 1)
# subplot has 2 rows and 2 columns, plot location 2
axes2 = fig.add_subplot(2, 2, 2)
# subplot has 2 rows and 2 columns, plot location 3
axes3 = fig.add_subplot(2, 2, 3)
# subplot has 2 rows and 2 columns, plot location 4
axes4 = fig.add_subplot(2, 2, 4)
We can use the plot
method on each axis to create our plot (Figure 3.4).
# add a plot to each of the axes created above
axes1.plot(dataset_1['x'], dataset_1['y'], 'o')
axes2.plot(dataset_2['x'], dataset_2['y'], 'o')
axes3.plot(dataset_3['x'], dataset_3['y'], 'o')
axes4.plot(dataset_4['x'], dataset_4['y'], 'o')
[<matplotlib.lines.Line2D at 0x7f8f96598b70>]
Finally, we can add a label to our subplots, and use the tight_layout
to make sure the axes are apread apart from one another (Figure 3.5).
# add a small title to each subplot
axes1.set_title("dataset_1")
axes2.set_title("dataset_2")
axes3.set_title("dataset_3")
axes4.set_title("dataset_4")
# add a title for the entire figure
fig.suptitle("Anscombe Data")
# use a tight layout
fig.tight_layout()
The Anscombe data visualizations illustrate why just looking at summary statistic values can be misleading. The moment the points are visualized, it becomes clear that even though each data set has the same summary statistic values, the relationships between points vastly differ across the data sets.
To finish off the Anscombe example, we can add set_xlabel()
and set_ylabel()
to each of the subplots to add x-and y-axis labels, just as we added a title to the figure.
Before we move on and learn how to create more statistical plots, you should become familiar with the matplotlib
documentation on “Parts of a Figure.”1 I have reproduced its older figure in Figure 3.6, and the newer figure in Figure 3.7.
1. Parts of a matplotlib
figure: http://matplotlib.org/faq/usage_faq.html#parts-of-a-figure
One of the most confusing parts of plotting in Python is the use of the terms “axis” and “axes,” especially when trying to verbally describe the different parts (since they are pronounced the same way). In the Anscombe example, each individual subplot plot has axes. The axes contain both an x-axis and a y-axis. All four subplots together make the figure.
The remainder of the chapter shows you how to create statistical plots, first with matplotlib
and later using a higher-level plotting library that is based on matplotlib
and specifically made for statistical graphics, seaborn
.
The tips data we will be using for the next series of visualizations come from the seaborn
library. This data set contains the amount of the tips that people leave for various variables. For example, the total cost of the bill, the size of the party, the day of the week, and the time of.
We can load this data set just as we did the Anscombe data set.
tips = sns.load_dataset("tips")
print(tips.head())
In statistics jargon, the term “univariate” refers to a single variable.
Histograms are the most common means of looking at a single variable. The values are “binned,” meaning they are grouped together and plotted to show the distribution of the variable (Figure 3.8).
fig = plt.figure()
axes1 = fig.add_subplot(1, 1, 1)
axes1.hist(tips['total_bill'], bins=10)
axes1.set_title('Histogram of Total Bill')
axes1.set_xlabel('Frequency')
axes1.set_ylabel('Total Bill')
fig.show()
In statistics jargon, the term “bivariate” refers to a two variables.
Scatterplots are used when a continuous variable is plotted against another continuous variable (Figure 3.9).
scatter_plot = plt.figure()
axes1 = scatter_plot.add_subplot(1, 1, 1)
axes1.scatter(tips['total_bill'], tips['tip'])
axes1.set_title('Scatterplot of Total Bill vs Tip')
axes1.set_xlabel('Total Bill')
axes1.set_ylabel('Tip')
scatter_plot.show()
Boxplots are used when a discrete variable is plotted against a continuous variable (Figure 3.10).
boxplot = plt.figure()
axes1 = boxplot.add_subplot(1, 1, 1)
axes1.boxplot(
# first argument of boxplot is the data
# since we are plotting multiple pieces of data
# we have to put each piece of data into a list
[tips[tips['sex'] == 'Female']['tip'],
tips[tips['sex'] == 'Male']['tip']],
# we can then pass in an optional labels parameter
# to label the data we passed
labels=['Female', 'Male'])
axes1.set_xlabel('Sex')
axes1.set_ylabel('Tip')
axes1.set_title('Boxplot of Tips by Sex')
boxplot.show()
Plotting multivariate data is tricky, because there isn’t a panacea or template that can be used for every case. To illustrate the process of plotting multivariate data, let’s build on our earlier scatterplot. If we wanted to add another variable, say sex
, one option would be to color the points based on the value of the third variable.
If we wanted to add a fourth variable, we could add size to the dots. The only caveat with using size as a variable is that humans are not very good at differentiating areas. Sure, if there’s an enormous dot next to a tiny one, your point will be conveyed, but smaller differences are difficult to distinguish, and may add clutter to your visualization. One way to reduce clutter is to add some value of transparency to the individual points, such that many overlapping points will show a darker region of a plot than less crowded areas.
The general rule of thumb is that different colors are much easier to distinguish than changes in size. If you have to use areas to convey differences in values, be sure that you are actually plotting relative areas. A common pitfall is to map a value to the radius of a circle for plots, but since the formula for a circle is πr2, your areas are actually based on a squared scale. That is not only misleading, but also wrong.
Colors are also difficult to pick. Humans do not perceive hues on a linear scale, so you need to think carefully when picking color palettes. Luckily matplotlib
2 and seaborn
3 come with their own set of color palettes, and tools like colorbrewer
4 can help you pick good color palettes.
2. matplotlib colormaps
: http://matplotlib.org/users/colormaps.html
3. seaborn
color palettes: http://stanford.edu/~mwaskom/software/seaborn-dev/tutorial/color_palettes.html
4. colorbrewer
color palettes http://colorbrewer2.org/
Figure 3.11 uses color to add a third variable, sex
, to our scatter plot.
# create a color variable based on sex
def recode_sex(sex):
if sex == 'Female':
return 0
else:
return 1
tips['sex_color'] = tips['sex'].apply(recode_sex)
scatter_plot = plt.figure()
axes1 = scatter_plot.add_subplot(1, 1, 1)
axes1.scatter(
x=tips['total_bill'],
y=tips['tip'],
# set the size of the dots based on party size
# we multiply the values by 10 to make the points bigger
# and to emphasize the differences
s=tips['size'] * 10,
# set the color for the sex
c=tips['sex_color'],
# set the alpha value so points are more transparent
# this helps with overlapping points
alpha=0.5)
axes1.set_title('Total Bill vs Tip Colored by Sex and Sized by Size')
axes1.set_xlabel('Total Bill')
axes1.set_ylabel('Tip')
scatter_plot.show()
The matplotlib
library can be thought of as the core foundational plotting tool in Python. seaborn
builds on matplotlib
by providing a higher-level interface for statistical graphics. It provides an interface to produce prettier and more complex visualizations with fewer lines of code.
The seaborn
library is tightly integrated with Pandas
and the rest of the PyData stack (numpy
, scipy
, statsmodels
), making visualizations from any part of the data analysis process a breeze. Since seaborn
is built on top of matplotlib
, the user still has the ability to fine-tune the visualizations.
We’ve already loaded the seaborn
library so that we could access its data sets.
# load seaborn if you have not done so already
import seaborn as sns
tips = sns.load_dataset("tips")
Just like we did with the matplotlib
examples, we will make a series of univariate plots.
Histograms are created using sns.distplot
5 (Figure 3.12).
5. seaborn distplot
documentation: https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html#seaborn.distplot
# this subplots function is a shortcut for
# creating separate figure objects and
# adding individual subplots (axes) to the figure
hist, ax = plt.subplots()
# use the distplot function from seaborn to create our plot
ax = sns.distplot(tips['total_bill'])
ax.set_title('Total Bill Histogram with Density Plot')
plt.show() # we still need matplotlib.pyplot to show the figure
The default distplot
will plot both a histogram and a density plot (using a kernel density estimation). If we just want the histogram, we can set the kde
parameter to False
. The results are shown in Figure 3.13.
hist, ax = plt.subplots()
ax = sns.distplot(tips['total_bill'], kde=False)
ax.set_title('Total Bill Histogram')
ax.set_xlabel('Total Bill')
ax.set_ylabel('Frequency')
plt.show()
Density plots are another way to visualize a univariate distribution (Figure 3.14). In essence, they are created by drawing a normal distribution centered at each data point, and then smoothing out the overlapping plots so that the area under the curve is 1.
den, ax = plt.subplots()
ax = sns.distplot(tips['total_bill'], hist=False)
ax.set_title('Total Bill Density')
ax.set_xlabel('Total Bill')
ax.set_ylabel('Unit Probability')
plt.show()
You can also sns.kdeplot
if you just want a density plot.
Rug plots are a one-dimensional representation of a variable’s distribution. They are typically used with other plots to enhance a visualization. Figure 3.15 shows a histogram overlaid with a density plot and a rug plot on the bottom.
hist_den_rug, ax = plt.subplots()
ax = sns.distplot(tips['total_bill'], rug=True)
ax.set_title('Total Bill Histogram with Density and Rug Plot')
ax.set_xlabel('Total Bill')
plt.show()
Bar plots are very similar to histograms, but instead of binning values to produce a distribution, bar plots can be used to count discrete variables. A count plot (Figure 3.16) is used for this purpose.
count, ax = plt.subplots()
ax = sns.countplot('day', data=tips)
ax.set_title('Count of days')
ax.set_xlabel('Day of the Week')
ax.set_ylabel('Frequency')
plt.show()
We will now use the seaborn
library to plot two variables.
There are a few ways to create a scatterplot in seaborn
. There is no explicit function named scatter
. Instead, we use regplot
. It will plot a scatterplot and also fit a regression line. If we set fit_reg=False
, the visualization will show only the scatterplot (Figure 3.17).
scatter, ax = plt.subplots()
ax = sns.regplot(x='total_bill', ='tip', data=tips)
ax.set_title('Scatterplot of Total Bill and Tip')
ax.set_xlabel('Total Bill')
ax.set_ylabel('Tip')
plt.show()
A similar function, lmplot
, can also create scatterplots. Internally, lmplot
calls regplot
, so regplot
is a more general plotting function. The main difference is that regplot
creates axes (Figure 3.6) whereas lmplot
creates a figure (Figure 3.18).
fig = sns.lmplot(x='total_bill', y='tip', data=tips)
plt.show()
We can also create a scatterplot that includes a univariate plot on each axis using jointplot
(Figure 3.19). One major difference is that jointplot
does not return axes, so we do not need to create a figure with axes on which to place our plot. Instead, this function creates a JointGrid
object.
joint = sns.jointplot(x='total_bill', y='tip', data=tips)
joint.set_axis_labels(xlabel='Total Bill', ylabel='Tip')
# add a title, set font size,
# and move the text above the total bill axes
joint.fig.suptitle('Joint Plot of Total Bill and Tip',
fontsize=10, y=1.03)
Scatterplots are great for comparing two variables. However, sometimes there are too many points for a scatterplot to be meaningful. One way to get around this issue is to bin points on the plot together. Just as histograms can bin a variable to create a bar, so hexbin
can bin two variables (Figure 3.20). A hexagon is used for this purpose because it is the most efficient shape to cover an arbitrary 2D surface. This is an example of seaborn
building on top of matplotlib
, as hexbin
is a matplotlib
function.
hexbin = sns.jointplot(x="total_bill", y="tip", data=tips, kind="hex")
hexbin.set_axis_labels(xlabel='Total Bill', ylabel='Tip')
hexbin.fig.suptitle('Hexbin Joint Plot of Total Bill and Tip',
fontsize=10, y=1.03)
You can also create a 2D kernel density plot. This kind of process is similar to how sns.kdeplot
works, except it creates a density plot across two variables. The bivariate plot can be shown on its own (Figure 3.21), or you can place the two univariate plots next to each other using jointplot
(Figure 3.22).
kde, ax = plt.subplots()
ax = sns.kdeplot(data=tips['total_bill'],
data2=tips['tip'],
shade=True) # shade will fill in the contours
ax.set_title('Kernel Density Plot of Total Bill and Tip')
ax.set_xlabel('Total Bill')
ax.set_ylabel('Tip')
plt.show()
kde_joint = sns.jointplot(x='total_bill', y='tip',
data=tips, kind='kde')
Bar Plots can also be used to show multiple variables. By default, barplot
will calculate a mean (Figure 3.23), but you can pass any function into the estimator
parameter. For example, you could pass in the numpy.std
function to calculate the standard deviation.
bar, ax = plt.subplots()
ax = sns.barplot(x='time', y='total_bill', data=tips)
ax.set_title('Bar plot of average total bill for time of day')
ax.set_xlabel('Time of day')
ax.set_ylabel('Average total bill')
plt.show()
Unlike the previously mentioned plots, a boxplot (Figure 3.24) shows multiple statistics: the minimum, first quartile, median, third quartile, maximum, and, if applicable, outliers based on the interquartile range.
The y
parameter is optional. If it is omitted, the plotting function will create a single box in the plot.
box, ax = plt.subplots()
ax = sns.boxplot(x='time', y='total_bill', data=tips)
ax.set_title('Boxplot of total bill by time of day')
ax.set_xlabel('Time of day')
ax.set_ylabel('Total Bill')
plt.show()
Boxplots are a classical statistical visualization, but they can obscure the underlying distribution of the data. Violin plots (Figure 3.25) are able to show the same values as a boxplot, but plot the “boxes” as a kernel density estimation. This can help retain more visual information about your data since only plotting summary statistics can be misleading, as seen by the Anscombe quartet.
violin, ax = plt.subplots()
ax = sns.violinplot(x='time', y='total_bill', data=tips)
ax.set_title('Violin plot of total bill by time of day')
ax.set_xlabel('Time of day')
ax.set_ylabel('Total Bill')
plt.show()
When you have mostly numeric data, visualizing all of the pairwise relationships can be easily performed using pairplot
. This function will plot a scatterplot between each pair of variables, and a histogram for the univariate data (Figure 3.26).
fig = sns.pairplot(tips)
One drawback when using pairplot
is that there is redundant information; that is, the top half of the the visualization is the same as the bottom half. We can use pairgrid
to manually assign the plots for the top half and bottom half. This plot is shown in Figure 3.27.
pair_grid = sns.PairGrid(tips)
# we can use plt.scatter instead of sns.regplot
pair_grid = pair_grid.map_upper(sns.regplot)
pair_grid = pair_grid.map_lower(sns.kdeplot)
pair_grid = pair_grid.map_diag(sns.distplot, rug=True)
plt.show()
As mentioned in Section 3.3.3, there is no de facto template for plotting multivariate data. Possible ways to include more information are to use color, size, and shape to distinguish data within the plot.
When we are using violinplot
, we can pass the hue
parameter to color the plot by sex
. We can reduce the redundant information by having each half of the violins represent a different sex
, as shown in Figure 3.28. Try the following code with and without the split
parameter.
violin, ax = plt.subplots()
ax = sns.violinplot(x='time', y='total_bill',
hue='sex', data=tips,
split=True)
plt.show()
The hue
parameter can be passed into various other plotting functions as well. Figure 3.29 shows its use in a lmplot
:
# note the use of lmplot instead of regplot here
scatter = sns.lmplot(x='total_bill', y='tip', data=tips,
hue='sex', fit_reg=False)
plt.show()
We can make our pairwise plots a little more meaningful by passing one of the categorical variables as the hue
parameter. Figure 3.30 shows this approach in our pairplot
.
fig = sns.pairplot(tips, hue='sex')
Working with point sizes can be another means to add more information to a plot. However, this option should be used sparingly, since the human eye is not very good at comparing areas.
Let’s consider an example of how seaborn
works with matplotlib
function calls. If you look in the documentation for lmplot
,6 you’ll see that lmplot
takes a parameter called catter, line scatter, line_kws
. In other words, there is a parameter in lmplot
called scatter_kws
and line_kws
. Both of these parameters take a key–value pair—a Python dict
(dictionary) to be more exact (Appendix K). Key–value pairs passed into scatter_kws
are then passed on to the matplotlib
function plt.scatter
. This is how we would access the s
parameter to change the size of the points, as we did in Section 3.3.3. This is shown in Figure 3.31.
6. https://web.stanford.edu/~mwaskom/software/seaborn/generated/seaborn.lmplot.html
scatter = sns.lmplot(x='total_bill', y='tip', data=tips,
fit_reg=False,
hue='sex',
scatter_kws={'s': tips['size']*10})
plt.show()
Also, when working with multiple variables, sometimes having two plot elements that show the same information is helpful. Figure 3.32 shows the use of color and shape to distinguish different values of the variable sex
.
scatter = sns.lmplot(x='total_bill', y='tip', data=tips,
fit_reg=False, hue='sex', markers=['o', 'x'],
scatter_kws={'s': tips['size']*10})
plt.show()
What if we want to show more variables? Or if we know which plot we want for our visualization, but we want to make multiple plots over a categorical variable? Facets are designed to meet these needs. Instead of you needing to individually subset data and lay out the axes in a figure (as we did in Figure 3.5), facets in seaborn
can handle this work for you.
To use facets, your data needs to be what Hadley Wickham7 calls “Tidy Data,”8 where each row represents an observation in the data, and each column is a variable (also known as “long data”).
8. http://vita.had.co.nz/papers/tidy-data.pdf
Figure 3.33 shows a re-creation of the Anscombe quartet data from Figure 3.5 in seaborn
.
anscombe_plot = sns.lmplot(x='x', y='y', data=anscombe,
fit_reg=False,
col='dataset', col_wrap=2)
All we needed to do to create this visualization was to pass two more parameters into the scatterplot function in seaborn
. The col
parameter is the variable that the plot will facet by, and the col_wrap
parameter creates a figure that has two columns. If we do not use the col_wrap
parameter, all four plots will be plotted in the same row.
Section 3.4.2.1 discussed the differences between lmplot
and regplot
. lmplot
is a figure-level function. In contrast, many of the plots we created in seaborn
are axes-level functions. What this means is that not every plotting function will have col
and col_wrap
parameters for faceting. Instead, we must create a FacetGrid
that knows which variable to facet on, and then supply the individual plot code for each facet. Figure 3.34 shows our manually created facet plot.
# create the FacetGrid
facet = sns.FacetGrid(tips, col='time')
# for each value in time, plot a histogram of total bill
facet.map(sns.distplot, 'total_bill', rug=True)
plt.show()
The individual facets need not be univariate plots, as seen in Figure 3.35.
facet = sns.FacetGrid(tips, col='day', hue='sex')
facet = facet.map(plt.scatter, 'total_bill', 'tip')
facet = facet.add_legend()
plt.show()
If you wanted to continue working in seaborn
, you could create the same plot using lmplot
, as shown in Figure 3.36.
fig = sns.lmplot(x='total_bill', y='tip', data=tips, fit_reg=False,
hue='sex', col='day')
plt.show()
Another thing you can do with facets is to have one variable be faceted on the x-axis, and another variable faceted on the y-axis. We accomplish this by passing a row
parameter. The result is shown in Figure 3.37.
facet = sns.FacetGrid(tips, col='time', row='smoker', hue='sex')
facet.map(plt.scatter, 'total_bill', 'tip')
plt.show()
If you do not want all of the hue
elements to overlap (i.e., you want this behavior in scatterplots, but not violin plots), you can use the sns.factorplot
function. The result is shown in Figure 3.38.
facet = sns.factorplot(x='day', y='total_bill', hue='sex', data=tips,
row='smoker', col='time', kind='violin')
Pandas objects also come equipped with their own plotting functions. Just as in seaborn
, the plotting functions built into Pandas are just wrappers around matplotlib
with preset values.
In general, plotting using Pandas follows the DataFrame.plot.PLOT_TYPE
or Series.plot.PLOT_TYPE
functions.
Histograms can be created using the Series.plot.hist
(Figure 3.39) or DataFrame.plot.hist
(Figure 3.40) function.
# on a series
fig, ax = plt.subplots()
ax = tips['total_bill'].plot.hist()
plt.show()
# on a dataframe
# set an alpha channel transparency
# so we can see though the overlapping bars
fig, ax = plt.subplots()
ax = tips[['total_bill', 'tip']].plot.hist(alpha=0.5, bins=20, ax=ax)
plt.show()
The kernel density estimation (density) plot can be created with the DataFrame.plot.kde
function (Figure 3.41).
fig, ax = plt.subplots()
ax = tips['tip'].plot.kde()
plt.show()
Scatterplots are created by using the DataFrame.plot.scatter
function (Figure 3.42).
fig, ax = plt.subplots()
ax = tips.plot.scatter(x='total_bill', y='tip', ax=ax)
plt.show()
Hexbin plots are created using the Dataframe.plt.hexbin
function (Figure 3.43).
fig, ax = plt.subplots()
ax = tips.plot.hexbin(x='total_bill', y='tip', ax=ax)
plt.show()
Grid size can be adjusted with the gridsize
parameter (Figure 3.44).
fig, ax = plt.subplots()
ax = tips.plot.hexbin(x='total_bill', y='tip', gridsize=10, ax=ax)
plt.show()
Boxplots are created with the DataFrame.plot.box
function (Figure 3.45).
fig, ax = plt.subplots()
ax = tips.plot.box(ax=ax)
plt.show()
The seaborn
plots shown in this chapter have all used the default plot styles. We can change the plot style with the sns.set_style
function. Typically, this function is run just once at the top of your code; all subsequent plots will use the same style set.
The styles that come with seaborn
are darkgrid
, whitegrid
, dark
, white
, and ticks
. Figure 3.46 shows a base plot, and Figure 3.47 shows a plot with the whitegrid
style.
# initial plot for comparison
fig, ax = plt.subplots()
ax = sns.violinplot(x='time', y='total_bill',
hue='sex', data=tips,
split=True)
plt.show()
# set style and plot
sns.set_style('whitegrid')
fig, ax = plt.subplots()
ax = sns.violinplot(x='time', y='total_bill',
hue='sex', data=tips,
split=True)
plt.show()
The following code shows what all the styles look like (Figure 3.48).
fig = plt.figure()
seaborn_styles = ['darkgrid', 'whitegrid', 'dark', 'white', 'ticks']
for idx, style
in enumerate(seaborn_styles):
plot_position = idx + 1
with sns.axes_style(style):
ax = fig.add_subplot(2, 3, plot_position)
violin = sns.violinplot(x='time', y='total_bill',
data=tips, ax=ax)
violin.set_title(style)
fig.tight_layout()
plt.show()
Data visualization is an integral part of exploratory data analysis and data presentation. This chapter provided an introduction to the various ways to explore and present your data. As we continue through the book, we will learn about more complex visualizations.
There are a myriad of plotting and visualization resources available on the Internet. The seaborn
documentation,9 Pandas visualization documentation,10 and matplotlib
documentation11 all provide ways to further tweak your plots (e.g., colors, line thickness, legend placement, figure annotations). Other resources include colorbrewer
12 to help pick good color schemes. The plotting libraries mentioned in this chapter also have various color schemes that can be used to highlight the content of your visualizations.
9. seaborn
documentation: https://stanford.edu/~mwaskom/software/seaborn/api.html
10. Pandas plotting documentation: http://pandas.pydata.org/pandas-docs/stable/visualization.html
11. matplotlib
documentation: http://matplotlib.org/api/index.html
12. colorbrewer: http://colorbrewer2.org
3.135.218.28