Do you want to visualize a series of data measurement (or observations) to show several properties of the data series (such as the median value, the spread of the data, and the distribution of the data) in one plot? And would you want to do that in a way where you can visually compare several similar data series? How would you visualize them? Welcome to the box-and-whisker plot! Probably the best plot type for comparing distributions, if you are talking to people used to information density.
The box-and-whisker plot usage examples range from comparing test scores between schools to comparing process parameters before and after changes (optimization).
What are the elements of box-and-whisker plots? As we see in the following diagram, we have several important elements that carry information in the box-and-whisker plot. The first component is the box that carries information about the interquartile range going from lower to upper quartile values. The median value of the data is represented by a line across the box.
The whiskers extend from the box on both sides going from the first quartile (25 percentile) to the last quartile (75 percentile) of the data. In other words, the whiskers extend 1.5 times from the base of the inter-quartile range. In the case of a normal distribution, whiskers will cover 99.3% of the total data range.
If there are values outside the whiskers range, they will be displayed as fliers. Otherwise, the whiskers will cover the total range of the data.
Optionally, the box can also carry information about confidence intervals around the median. This is represented by a notch in the box. This information can be used to indicate whether the data in the two series is of the similar distribution. However, this is not rigorous and is just an indication that can be visually inspected.
In the following recipe, you will learn how to create a box-and-whisker plot using matplotlib.
We will perform the following steps:
PROCESSES
dictionary into DATA
.PROCESSES
dictionary into LABELS
.matplotlib.pyplot.boxplot
.axes
labels.The following code implements these steps:
import matplotlib.pyplot as plt # define data PROCESSES = { "A": [12, 15, 23, 24, 30, 31, 33, 36, 50, 73], "B": [6, 22, 26, 33, 35, 47, 54, 55, 62, 63], "C": [2, 3, 6, 8, 13, 14, 19, 23, 60, 69], "D": [1, 22, 36, 37, 45, 47, 48, 51, 52, 69], } DATA = PROCESSES.values() LABELS = PROCESSES.keys() plt.boxplot(DATA, notch=False, widths=0.3) # set ticklabel to process name plt.gca().xaxis.set_ticklabels(LABELS) # some clean up(removing chartjunk) # turn the spine off for spine in plt.gca().spines.values(): spine.set_visible(False) # turn all ticks for x-axis off plt.gca().xaxis.set_ticks_position('none') # leave left ticks for y-axis on plt.gca().yaxis.set_ticks_position('left') # set axes labels plt.ylabel("Errors observed over defined period.") plt.xlabel("Process observed over defined period.") plt.show()
The box-and-whisker plot is rendered by first computing quartiles for the given data in DATA
.
These quartile values are used to compute lines to draw boxes and whiskers.
We adjusted the plot removing all the unnecessary lines (referring to superfluous lines such as chart junk, as mentioned in the famous book, The Visual Display of Quantitative Information, by Edward R. Tufte). Those lines do not carry information and just put more pressure on the mental models in a viewer's brain to decode all the lines before discovering real valuable information.
18.117.232.239