By the end of this chapter, you will be able to:
This chapter will cover various concepts that fall under data visualization.
Data visualization is a powerful tool that allows users to digest large amounts of data very quickly. There are different types of plots that serve various purposes. In business, line plots and bar graphs are very common to display trends over time and compare metrics across groups, respectively. Statisticians, on the other hand, may be more interested in checking correlations between variables using a scatterplot or correlation matrix. They may also use histograms to check the distribution of a variable or boxplots to check for outliers. In politics, pie charts are widely used for comparing the total data between or among categories. Data visualizations can be very intricate and creative, being limited only by one's imagination.
The Python library Matplotlib is a well-documented, two-dimensional plotting library that can be used to create a variety of powerful data visualizations and aims to "...make easy things easy and hard things possible" (https://matplotlib.org/index.html).
There are two approaches to creating plots using Matplotlib, the functional and the object-oriented approach.
In the functional approach, one figure is created with a single plot. Plots are created and customized by a collection of sequential functions. However, the functional approach does not allow us to save the plot to our environment as an object; this is possible using the object-oriented approach. In the object-oriented approach, we create a figure object and assign an axis or numerous axes for one plot or multiple subplots, respectively. We can then customize the axis or axes and call that single plot or set of multiple plots by calling the figure object.
In this chapter, we will use the functional approach to create and customize line plots, bar plots, histograms, scatterplots, and box-and-whisker plots. We will then learn how to create and customize single-axis and multiple-axes plots using the object-oriented approach.
The functional approach to plotting in Matplotlib is a way of quickly generating a single-axis plot. Often, this is the approach taught to beginners. The functional approach allows the user to customize and save plots as image files in a chosen directory. In the following exercises and activities, you will learn how to build line plots, bar plots, histograms, box-and-whisker plots, and scatterplots using the functional approach.
To get started with Matplotlib, we will begin by creating a line plot and go on to customize it:
import numpy as np
x = np.linspace(0, 10, 20)
y = x**3
import matplotlib.pyplot as plt
plt.plot(x, y)
plt.show()
See the resultant output here:
plt.xlabel('Linearly Spaced Numbers')
plt.ylabel('y Value')
plt.title('x by x Cubed')
plt.plot(x, y, 'k')
Print the plot to the console using plt.show().
Check out the following screenshot for the resultant output:
plt.plot(x, y, 'Dk')
See the figure below for the resultant output:
plt.plot(x, y, 'D-k')
Refer to the following figure to see the output:
plt.title('x by x Cubed', fontsize=22)
plt.show()
Here, we used the functional approach to create a single-line line plot and styled it to make it more aesthetically pleasing. However, it is not uncommon to compare multiple trends in a single plot. Thus, the next exercise will detail plotting multiple lines on a line plot and creating a legend to discern the lines.
Matplotlib makes adding another line to a line plot very easy by simply specifying another plt.plot() instance. In this exercise, we will plot the lines for x-cubed and x-squared using separate lines:
y2 = x**2
Refer to the output here:
plt.plot(x, y2, '--r')
The output is shown in the following figure:
plt.plot(x, y, 'D-k', label='x cubed')
plt.plot(x, y2, '--r', label='x squared')
Check out the following screenshot for the resultant output:
plt.title('As x increases, x Cubed (black) increases at a Greater Rate than x Squared (red)', fontsize=22)
Check the output in the following screenshot:
To see the output, refer to the following figure:
In this exercise, we learned how to create and style a single- and multi-line plot in Matplotlib using the functional approach. To help solidify our learning, we will plot another single-line plot with slightly different styling.
In this activity, we will create a line plot to analyze month-to-month trends for items sold in the months January through June. The trend will be positive and linear, and will be represented using a dotted, blue line, with star markers. The x-axis will be labeled 'Month' and the y-axis will be labeled 'Items Sold'. The title will say 'Items Sold has been Increasing Linearly:'
Check out the following screenshot for the resultant output:
We can refer to the solution for this activity on page 333.
So far, we have gained a lot of practice creating and customizing line plots. Line plots are commonly used for displaying trends. However, when comparing values between and/or among groups, bar plots are traditionally the visualization of choice. In the following exercise, we will explore how to create a bar plot.
In this exercise, we will be displaying sales revenue by item type:
x = ['Shirts', 'Pants','Shorts','Shoes']
y = [1000, 1200, 800, 1800]
import matplotlib.pyplot as plt
plt.bar(x, y)
plt.show()
The following screenshot shows the resultant output:
plt.title('Sales Revenue by Item Type')
plt.xlabel('Item Type')
plt.ylabel('Sales Revenue ($)')
The following screenshot shows the output:
index_of_max_y = y.index(max(y))
most_sold_item = x[index_of_max_y]
plt.title('{} Produce the Most Sales Revenue'.format(most_sold_item))
Check the following output:
The output is shown in the following screenshot:
Remember, when a bar plot is transformed from vertical to horizontal, the x and y axes need to be switched.
Check out the following output for the final bar plot:
In the previous exercise, we learned how to create a bar plot. Building bar plots using Matplotlib is straightforward. In the following activity, we will continue to practice building bar plots.
In this activity, we will be creating a bar plot comparing the number of NBA championships among the five franchises with the most titles. The plot will be sorted so that the franchise with the greatest number of titles is on the left and the franchise with the least is on the right. The bars will be red, the x-axis will be titled 'NBA Franchises', the y-axis will be titled 'Number of Championships', and the title will be programmatic, explaining which franchise has the most titles and how many they have. Before working on this activity, make sure to research the required NBA franchise data online. Additionally, we will rotate the x tick labels 45 degrees using plt.xticks(rotation=45) so that they do not overlap, and we will save our plot to the current directory:
We can refer to the solution for this activity on page 334.
Line plots and bar plots are two very common and effective types of visualizations for reporting trends and comparing groups, respectively. However, for deeper statistical analyses, it is important to generate graphs that uncover characteristics of features not apparent with line plots and bar plots. Thus, in the following exercises, we will run through creating common statistical plots.
In statistics, it is essential to be aware of the distribution of continuous variables prior to running any type of analysis. To display the distribution, we will use a histogram. Histograms display the frequency by the bin for a given array:
import numpy as np
y = np.random.normal(loc=0, scale=0.1, size=100)
plt.hist(y, bins=20)
plt.xlabel('y Value')
plt.ylabel('Frequency')
When we look at a histogram, we often determine whether the distribution is normal. Sometimes, a distribution may appear normal when it is not, and sometimes a distribution may appear not normal when it is normal. There is a test for normality, termed the Shapiro-Wilk test. The null hypothesis for the Shapiro-Wilk test is that data is normally distributed. Thus, a p-value < 0.05 indicates a non-normal distribution while a p-value > 0.05 indicates a normal distribution. We will use the results from the Shapiro-Wilk test to create a programmatic title communicating to the reader whether the distribution is normal or not.
from scipy.stats import shapiro
shap_w, shap_p = shapiro(y)
if shap_p > 0.05:
normal_YN = 'Fail to reject the null hypothesis. Data is normally distributed.'
else:
normal_YN = 'Null hypothesis is rejected. Data is not normally distributed.'
Check out the final output in this screenshot:
As mentioned previously, histograms are used for displaying the distribution of an array. Another common statistical plot for exploring a numerical feature is a box-and-whisker plot, also referred to as a boxplot.
Box-and-whisker plots display the distribution of an array based on the minimum, first quartile, median, third quartile, and maximum, but they are primarily used to indicate the skew of a distribution and to identify outliers.
In this exercise, we will learn how to create a box-and-whisker plot and portray information regarding the shape of the distribution and the number of outliers in our title:
import numpy as np
y = np.random.normal(loc=0, scale=0.1, size=100)
import matplotlib.pyplot as plt
plt.boxplot(y)
plt.show()
For the output, refer to the following figure:
The plot displays a box that represents the interquartile range (IQR). The top of the box is the 25th percentile (i.e., Q1) while the bottom of the box is the 75th percentile (that is, Q3). The orange line going through the box is the median. The two lines extending above and below the box are the whiskers. The top of the upper whisker is the "maximum" value, which is calculated using Q1 – 1.5*IQR. The bottom of the lower whisker is the "minimum" value, which is calculated using Q3 + 1.5*IQR. Outliers (or fringe outliers) are displayed as dots above the "maximum" whisker or below the "minimum" whisker.
from scipy.stats import shapiro
shap_w, shap_p = shapiro(y)
from scipy.stats import zscore
y_z_scores = zscore(y)
This is a measure of the data which shows how many standard deviations each datapoint is from the mean.
total_outliers = 0
for i in range(len(y_z_scores)):
if abs(y_z_scores[i]) >= 3:
total_outliers += 1
Because the array, y, was generated to be normally distributed, we can expect there to be no outliers in the data.
if shap_p > 0.05:
title = 'Normally distributed with {} outlier(s).'.format(total_outliers)
else:
title = 'Not normally distributed with {} outlier(s).'.format(total_outliers)
plt.show()
Histograms and box-and-whisker plots are effective in exploring the characteristics of numerical arrays. However, they do not provide information on the relationships between arrays. In the next exercise, we will learn how to create a scatterplot – a common visualization to display the relationship between two continuous arrays.
In this exercise, we will be creating a scatterplot of weight versus height. We will, again, create a title explaining the message of the plot being portrayed:
y = [5, 5.5, 5, 5.5, 6, 6.5, 6, 6.5, 7, 5.5, 5.25, 6, 5.25]
x = [100, 150, 110, 140, 140, 170, 168, 165, 180, 125, 115, 155, 135]
import matplotlib.pyplot as plt
plt.scatter(x, y)
plt.xlabel('Weight')
plt.ylabel('Height')
Our output should be similar to the following:
from scipy.stats import pearsonr
correlation_coeff, p_value = pearsonr(x, y)
if correlation_coeff == 1.00:
title = 'There is a perfect positive linear relationship (r = {0:0.2f}).'.format(correlation_coeff)
elif correlation_coeff >= 0.8:
title = 'There is a very strong, positive linear relationship (r = {0:0.2f}).'.format(correlation_coeff)
elif correlation_coeff >= 0.6:
title = 'There is a strong, positive linear relationship (r = {0:0.2f}).'.format(correlation_coeff)
elif correlation_coeff >= 0.4:
title = 'There is a moderate, positive linear relationship (r = {0:0.2f}).'.format(correlation_coeff)
elif correlation_coeff >= 0.2:
title = 'There is a weak, positive linear relationship (r = {0:0.2f}).'.format(correlation_coeff)
elif correlation_coeff > 0:
title = 'There is a very weak, positive linear relationship (r = {0:0.2f}).'.format(correlation_coeff)
elif correlation_coeff == 0:
title = 'There is no linear relationship (r = {0:0.2f}).'.format(correlation_coeff)
elif correlation_coeff <= -0.8:
title = 'There is a very strong, negative linear relationship (r = {0:0.2f}).'.format(correlation_coeff)
elif correlation_coeff <= -0.6:
title = 'There is a strong, negative linear relationship (r = {0:0.2f}).'.format(correlation_coeff)
elif correlation_coeff <= -0.4:
title = 'There is a moderate, negative linear relationship (r = {0:0.2f}).'.format(correlation_coeff)
elif correlation_coeff <= -0.2:
title = 'There is a weak, negative linear relationship (r = {0:0.2f}).'.format(correlation_coeff)
else:
title = 'There is a very weak, negative linear relationship (r = {0:0.2f}).'.format(correlation_coeff)
print(title)
Refer to the following figure for the resultant output:
Up to this point, we have learned how to create and style an assortment of plots for several different purposes using the functional approach. While this approach of plotting is effective for generating quick visualizations, it does not allow us to create multiple subplots or store the plot as an object in our environment. To save the plot as an object in our environment, we must use the object-oriented approach, which will be covered in the following exercises and activities.
Using the functional approach of plotting in Matplotlib does not allow the user to save the plot as an object in our environment. In the object-oriented approach, we create a figure object that acts as an empty canvas and then we add a set of axes, or subplots, to it. The figure object is callable and, if called, will return the figure to the console. We will demonstrate how this works by plotting the same x and y objects as we did in Exercise 13.
When we learned about the functional approach of plotting in Matplotlib, we began by creating and customizing a line plot. In this exercise, we will create and style a line plot using the functional plotting approach:
import numpy as np
x = np.linspace(0, 10, 20)
Save y as x cubed using the following:
y = x**3
import matplotlib.pyplot as plt
fig, axes = plt.subplots()
plt.show()
Check out the following screenshot to view the output:
The fig object is now callable and returns the axis on which we can plot.
axes.plot(x, y)
The following figure displays the output:
axes.plot(x, y, 'D-k')
axes.set_xlabel('Linearly Spaced Numbers')
axes.set_ylabel('y Value')
axes.set_title('As x increases, y increases by x cubed')
The following figure displays the output:
In this exercise, we created a plot very similar to the first plot in Exercise 13, but now it is a callable object. Another advantage of using the object-oriented plotting approach is the ability to create multiple subplots on a single figure object.
In some situations, we want to compare different views of data side by side. We can accomplish this in Matplotlib using subplots.
Thus, in this exercise, we will plot the same lines as in Exercise 14, but we will plot them on two subplots in the same, callable figure object. Subplots are laid out using a grid format and are accessible using [row, column] indexing. For example, if our figure object contains four subplots organized in two rows and two columns, we would index reference the top-left plot using axes[0,0] and the bottom-right plot using axes[1,1], as shown in the following figure.
In the remaining exercises and activities, we will get a lot of practice with generating subplots and accessing the various axes. In this exercise, we will be making multiple line plots using sublots:
import numpy as np
x = np.linspace(0, 10, 20)
y = x**3
y2 = x**2
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=1, ncols=2)
The resultant output is displayed here:
axes[0].plot(x, y)
axes[0].set_title('x by x Cubed')
axes[0].set_xlabel('Linearly Spaced Numbers')
axes[0].set_ylabel('y Value')
The resultant output is displayed here:
axes[1].plot(x, y2)
axes[1].set_title('x by x Squared')
axes[1].set_xlabel('Linearly Spaced Numbers')
axes[1].set_ylabel('y Value')
The following screenshot displays the output
The figure here displays the output:
Using the object-oriented approach, we can display both subplots just by calling the fig object. We will practice object-oriented plotting further in Activity 4.
We have learned uptil now how to build, customize, and program line plots, bar plots, histograms, scatterplots, and box-and-whisker plots using the functional approach. In exercise 19, we were introduced to the object-oriented approach, and in exercise 20, we learned how to create a figure with multiple plots using subplots. Thus, in this activity, we will be leveraging subplots to create a figure with multiple plots and plot types. We will be creating a figure with six subplots. The subplots will be displayed in three rows and two columns (see Figure 2.31):
Once we have generated our figure of six subplots, we access each subplot using 'row, column' indexing (see Figure 2.32):
Thus, to access the line plot (that is, top-left), use axes[0, 0]. To access the histogram (that is, middle-right), use axes[1, 1]. We will be practicing this in the following activity:
The solution for this activity can be found on page 338.
In this chapter, we used the Python plotting library Matplotlib to create, customize, and save plots using the functional approach. We then covered the importance of a descriptive title and created our own descriptive, programmatic titles. However, the functional approach does not create a callable figure object and it does not return subplots. Thus, to create a callable figure object with the potential of numerous subplots, we created, customized, and saved our plots using the object-oriented approach. Plotting needs can vary analysis to analysis, so covering every possible plot in this chapter is not practical. To create powerful plots that meet the needs of each individual analysis, it is imperative to become familiar with the documentation and examples found on the Matplotlib documentation page.
In the subsequent chapter, we will apply some of these plotting techniques as we dive into machine learning using scikit-learn.
3.146.176.68