Making error bars

Error bars are useful to display the dispersion of data on a plot. They are relatively simple as a form of visualization; however, they are also a bit problematic because what is shown as an error varies across different sciences and publications. This does not lessen the usefulness of error bars, it just imposes the need to always be careful and explicitly state the nature of the error visualized as an error bar.

Getting ready

To be able to plot an error bar in the raw observed data, we need to compute the mean and the error we want to display.

The error we compute represents the 95% confidence interval that the mean we get from our observation is stable, which means our observations are good estimates of the whole population.

Matplotlib supports these type of plots via matplotlib.pyplot.errorbar function.

It offers several kinds of error bars. They can be vertical (yerr) or horizontal (xerr) and symmetrical or asymmetrical.

How to do it...

In the following code we will:

  1. Use some sample data that consists of four sets of observations.
  2. For each set of observations, compute the mean value.
  3. For each set of observations, compute the 95% confidence interval.
  4. Render bars with vertical symmetrical error bars.

Here is the code for this:

import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sc

TEST_DATA = np.array([[1,2,3,2,1,2,3,4,2,3,2,1,2,3,4,4,3,2,3,2,3,2,1],
                      [5,6,5,4,5,6,7,7,6,7,7,2,8,7,6,5,5,6,7,7,7,6,5],
                      [9,8,7,8,8,7,4,6,6,5,4,3,2,2,2,3,3,4,5,5,5,6,1],
                      [3,2,3,2,2,2,2,3,3,3,3,4,4,4,4,5,6,6,7,8,9,8,5],
                      ])

# find mean for each of our observations
y = np.mean(TEST_DATA, axis=1, dtype=np.float64)
# and the 95% confidence interval
ci95 = np.abs(y - 1.96 * sc.sem(TEST_DATA, axis=1))

# each set is one try
tries = np.arange(0, len(y), 1.0)

# tweak grid and setup labels, limits
plt.grid(True, alpha=0.5)
plt.gca().set_xlabel('Observation #')
plt.gca().set_ylabel('Mean (+- 95% CI)')
plt.title("Observations with corresponding 95% CI as error bar.")
plt.bar(tries, y, align='center', alpha=0.2)
plt.errorbar(tries, y, yerr=ci95, fmt=None)

plt.show()

The preceding code will render a plot with error bars that display 95% confidence intervals as whiskers extending along the y axis. Remember, the wider the whiskers, the lesser are the probability that the observed mean is true. The following graph is the output for the preceding code:

How to do it...

How it works...

In order to avoid iterating over each set of observations, we use NumPy's vectorized methods to compute means and standard errors, which we use for plotting and computing error values.

Using NumPy's vectorized implementations, which are written in C language (and called from Python), allows us to speed up computations by several magnitudes.

This is not very important for a few data points but, for millions of data points, it can either make or break our efforts to create responsive applications.

Also, you may note that we explicitly specified dtype=np.float 64 in the np.mean function call. According to the official NumPy documentation reference (http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html),np.mean can be inaccurate if used in single precision; it's better to compute it with np.float32, or if performance is not an issue, use np.float 64.

There's more...

There is an ongoing issue with what to show on error bars. Some advise on using SD, 2SD, SE, or 95%CI. We must understand what the difference between all these values and what they are used for, in order to be able to give reasoning on what to use and when.

Standard Deviation informs us about the distribution of individual data points around the mean value. If we assume normal distribution, then we know that 68.2% (~2/3) of data values will fall between ±SD, and 95.4% of values will be between ±2*SD.

Standard Error is calculated as SD divided by the square root of N (SD/√N), where N is the number of data points. Standard Error (SE) informs us about variability of mean values, if we are able to perform the same sampling more than once (like performing the same study hundreds of times).

The confidence interval is calculated from SE, similar to how the range of values is calculated from Standard Deviation. To calculate 95% confidence interval, we must add/subtract 1.96 * SE to/from our mean value or use proper notation: 95% CI = M ± (1.96 * SE). The wider the confidence interval, the lesser we would be sure that we are right.

We see that in order to be sure that our estimation is correct and that we are giving its proof to our reader, we should display the confidence interval, which in turn carries the standard error; this, if small, proves that our means are stable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.180.43