Another very popular question regarding univariate data is, How variable are the data points? or How spread out or dispersed are the observations?. To answer these questions, we have to measure the spread, or dispersion, of a data sample.
The simplest way to answer that question is to take the smallest value in the dataset and subtract it by the largest value. This will give you the range. However, this suffers from a problem similar to the issue of the mean. The range in salaries at the law firm will vary widely depending on whether the CEO is included in the set. Further, the range is just dependent on two values, the highest and lowest, and therefore, can't speak of the dispersion of the bulk of the dataset.
One tactic that solves the first of these problems is to use the interquartile range.
What about measures of spread for categorical data?
The measures of spread that we talk about in this section are only applicable to numeric data. There are, however, measures of spread or diversity of categorical data. In spite of the usefulness of these measures, this topic goes unmentioned or blithely ignored in most data analysis and statistics texts. This is a long and venerable tradition that we will, for the most part, adhere to in this book. If you are interested in learning more about this, search for 'Diversity Indices’ on the web.
Remember when we said that the median split a sorted dataset into two equal parts, and that it was the 50th percentile because 50 percent of the observations fell below its value? In a similar way, if you were to divide a sorted data set into four equal parts, or quartiles, the three values that make these divides would be the first, second, and third quartiles respectively. These values can also be called the 25th, 50th, and 75th percentiles. Note that the second quartile, the 50th percentile, and the median are all equivalent.
The interquartile range is the difference between the third and first quartiles. If you apply the interquartile range to a sample of salaries at the law firm that includes the CEO, the enormous salary will be discarded with the highest 25 percent of the data. However, this still only uses two values, and doesn't speak to the variability of the middle 50 percent.
Well, one way we can use all the data points to inform our spread metric is by subtracting each element of a dataset from the mean of the dataset. This will give us the deviations, or residuals, from the mean. If we add up all these deviations, we will arrive at the sum of the deviations from the mean. Try to find the sum of the deviations from the mean in this set: {1, 3, 5, 6, 7}
.
If we try to compute this, we notice that the positive deviations are cancelled out by the negative deviations. In order to cope with this, we need to take the absolute value, or the magnitude of the deviation, and sum them.
This is a great start, but note that this metric keeps increasing if we add more data to the set. Because of this, we may want to take the average of these deviations. This is called the average deviation.
For those having trouble following the description in words, the formula for average deviation from the mean is the following:
where µ
is the mean, N
is the number elements of the sample, and is the ith element of the dataset. It can also be expressed in R as follows:
> sum(abs(x - mean(x))) / length(x)
Though average deviation is an excellent measure of spread in its own right, its use is commonly—and sometimes unfortunately—supplanted by two other measures.
Instead of taking the absolute value of each residual, we can achieve a similar outcome by squaring each deviation from the mean. This, too, ensures that each residual is positive (so that there is no cancelling out). Additionally, squaring the residuals has the sometimes desirable property of magnifying larger deviations from the mean, while being more forgiving of smaller deviations. The sum of the squared deviations is called (you guessed it!) the sum of squared deviations from the mean or, simply, sum of squares. The average of the sum of squared deviations from the mean is known as the variance and is denoted by .
When we square each deviation, we also square our units. For example, if our dataset held measurements in meters, our variance would be expressed in terms of meters squared. To get back our original units, we have to take the square root of the variance:
This new measure, denoted by σ, is the standard deviation, and it is one of the most important measures in this book.
Note that we switched from referring to the mean as to referring it as µ. This was not a mistake.
Remember that was the sample mean, and µ
represented the population mean. The preceding equations use µ to illustrate that these equations are computing the spread metrics on the population data set, and not on a sample. If we want to describe the variance and standard deviation of a sample, we use the symbols and s
instead of and σ respectively, and our equations change slightly:
Instead of dividing our sum of squares by the number of elements in the set, we are now dividing it by n-1. What gives?
To answer that question, we have to learn a little bit about populations, samples, and estimation.
3.15.203.124