Measures of central tendency and variability

Some of the measures used in descriptive statistics include the measures of central tendency and measures of variability.

A measure of central tendency is a single value that attempts to describe a dataset by specifying a central position within the data. The three most common measures of central tendency are the mean, median, and mode.

A measure of variability is used to describe the variability in a dataset. Measures of variability include variance and standard deviation.

Measures of central tendency

Let's take a look at the measures of central tendency and an illustration in the following sections.

The mean

The mean or sample is the most popular measure of central tendency. It is equal to the sum of all values in the dataset divided by the number of values in the dataset. Thus, in a dataset of n values, the mean is calculated as follows:

The mean

We use The mean if the data values are from a sample and μ if the data values are from a population.

The sample mean and population mean are different. The sample mean is what is known as an unbiased estimator of the true population mean. By repeated random sampling of the population to calculate the sample mean, we can obtain a mean of sample means. We can then invoke the law of large numbers and the central limit theorem (CLT) and denote the mean of sample means as an estimate of the true population mean.

The population mean is also referred to as the expected value of the population.

The mean, as a calculated value, is often not one of the values observed in the dataset. The main drawback of using the mean is that it is very susceptible to outlier values, or if the dataset is very skewed. For additional information, please refer to these links at http://en.wikipedia.org/wiki/Sample_mean_and_sample_covariance, http://en.wikipedia.org/wiki/Law_of_large_numbers, and http://bit.ly/1bv7l4s.

The median

The median is the data value that divides the set of sorted data values into two halves. It has exactly half of the population to its left and the other half to its right. In the case when the number of values in the dataset is even, the median is the average of the two middle values. It is less affected by outliers and skewed data.

The mode

The mode is the most frequently occurring value in the dataset. It is more commonly used for categorical data in order to know which category is most common. One downside to using the mode is that it is not unique. A distribution with two modes is described as bimodal, and one with many modes is denoted as multimodal. Here is an illustration of a bimodal distribution with modes at two and seven since they both occur four times in the dataset:

In [4]: import matplotlib.pyplot as plt
           %matplotlib inline  
In [5]: plt.hist([7,0,1,2,3,7,1,2,3,4,2,7,6,5,2,1,6,8,9,7])
           plt.xlabel('x')
           plt.ylabel('Count')
           plt.title('Bimodal distribution')
           plt.show()
The mode

Computing measures of central tendency of a dataset in Python

To illustrate, let us consider the following dataset consisting of marks obtained by 15 pupils for a test scored out of 20:

In [18]: grades = [10, 10, 14, 18, 18, 5, 10, 8, 1, 12, 14, 12, 13, 1, 18]

The mean, median, and mode can be obtained as follows:

In [29]: %precision 3  # Set output precision to 3 decimal places
Out[29]:u'%.3f'

In [30]: import numpy as np
         np.mean(grades)
Out[30]: 10.933

In [35]: %precision
         np.median(grades)
Out[35]: 12.0

In [24]: from scipy import stats
         stats.mode(grades)
Out[24]: (array([ 10.]), array([ 3.]))
In [39]: import matplotlib.pyplot as plt
In [40]: plt.hist(grades)
         plt.title('Histogram of grades')
         plt.xlabel('Grade')
         plt.ylabel('Frequency')
         plt.show()
Computing measures of central tendency of a dataset in Python

To illustrate how the skewness of data or an outlier value can drastically affect the usefulness of the mean as a measure of central tendency, consider the following dataset that shows the wages (in thousands of dollars) of the staff at a factory:

In [45]: %precision 2
         salaries = [17, 23, 14, 16, 19, 22, 15, 18, 18, 93, 95]

In [46]: np.mean(salaries)
Out[46]: 31.82

Based on the mean value, we may make the assumption that the data is centered around the mean value of 31.82. However, we would be wrong. To see this, let's display an empirical distribution of the data using a bar plot:

In [59]: fig = plt.figure()
         ax = fig.add_subplot(111)
         ind = np.arange(len(salaries))
         width = 0.2
         plt.hist(salaries, bins=xrange(min(salaries),
         max(salaries)).__len__())
         ax.set_xlabel('Salary')
         ax.set_ylabel('# of employees')
         ax.set_title('Bar chart of salaries')
         plt.show()
Computing measures of central tendency of a dataset in Python

From the preceding bar plot, we can see that most of the salaries are far below 30K and no one is close to the mean of 32K. Now, if we take a look at the median, we see that it is better measure of central tendency in this case:

In [47]: np.median(salaries)
Out[47]: 18.00

We can also take a look at a histogram of the data:

In [56]: plt.hist(salaries, bins=len(salaries))
         plt.title('Histogram of salaries')
         plt.xlabel('Salary')
         plt.ylabel('Frequency')
         plt.show()
Computing measures of central tendency of a dataset in Python

Note

The histogram is actually a better representation of the data as bar plots are generally used to represent categorical data while histograms are preferred for quantitative data, which is the case for the salaries' data.

For more information on when to use histograms versus bar plots, refer to http://onforb.es/1Dru2gv.

If the distribution is symmetrical and unimodal (that is, has only one mode), the three measures—mean, median, and mode—will be equal. This is not the case if the distribution is skewed. In that case, the mean and median will differ from each other. With a negatively skewed distribution, the mean will be lower than the median and vice versa for a positively skewed distribution:

Computing measures of central tendency of a dataset in Python

The preceding figure is sourced from http://www.southalabama.edu/coe/bset/johnson/lectures/lec15_files/image014.jpg.

Measures of variability, dispersion, or spread

Another characteristic of distribution that we measure in descriptive statistics is variability.

Variability specifies how much the data points are different from each other, or dispersed. Measures of variability are important because they provide an insight into the nature of the data that is not provided by the measures of central tendency.

As an example, suppose we conduct a study to examine how effective a pre-K education program is in lifting test scores of economically disadvantaged children. We can measure the effectiveness not only in terms of the average value of the test scores of the entire sample but also with the dispersion of the scores. Is it useful for some students and not so much for others? The variability of the data may help us identify some steps to be taken to improve the usefulness of the program.

Range

The simplest measure of dispersion is the range. The range is the difference between the lowest and highest scores in a dataset. This is the simplest measure of spread.

Range = highest value - lowest value

Quartile

A more significant measure of dispersion is the quartile and related interquartile ranges. It also stands for quarterly percentile, which means that it is the value on the measurement scale below which 25, 50, 75, and 100 percent of the scores in the sorted dataset fall. The quartiles are three points that split the dataset into four groups, with each one containing one-fourth of the data. To illustrate, suppose we have a dataset of 20 test scores where we rank them as follows:

In [27]: import random
         random.seed(100)
         testScores = [random.randint(0,100) for p in 
                       xrange(0,20)]
         testScores
Out[27]: [14, 45, 77, 71, 73, 43, 80, 53, 8, 46, 4, 94, 95, 33, 31, 77, 20, 18, 19, 35]

In [28]: #data needs to be sorted for quartilessortedScores = np.sort(testScores) 
In [30]: rankedScores = {i+1: sortedScores[i] for i in 
                         xrange(len(sortedScores))}

In [31]: rankedScores
Out[31]:
{1: 4,
 2: 8,
 3: 14,
 4: 18,
 5: 19,
 6: 20,
 7: 31,
8: 33,
 9: 35,
 10: 43,
 11: 45,
 12: 46,
 13: 53,
 14: 71,
 15: 73,
 16: 77,
 17: 77,
 18: 80,
 19: 94,
 20: 95}

The first quartile (Q1) lies between the fifth and sixth score, the second quartile (Q2) between the tenth and eleventh score, and the third quartile between the fifteenth and sixteenth score. Thus, we have (by using linear interpolation and calculating the midpoint):

Q1 = (19+20)/2 = 19.5
Q2 = (43 + 45)/2 = 44
Q3 = (73 + 77)/2 = 75

To see this in IPython, we can use the scipy.stats or numpy.percentile packages:

In [38]: from scipy.stats.mstats import mquantiles
         mquantiles(sortedScores)
Out[38]: array([ 19.45,  44.  ,  75.2 ])

In [40]: [np.percentile(sortedScores, perc) for perc in [25,50,75]]
Out[40]: [19.75, 44.0, 74.0]

The reason why the values don't match exactly with our previous calculations is due to the different interpolation methods. More information on the various types of methods to obtain quartile values can be found at http://en.wikipedia.org/wiki/Quartile. The interquartile range is the first quartile subtracted from the third quartile (Q3 - Q1), It represents the middle 50 in a dataset.

For more information, refer to http://bit.ly/1cMMycN.

For more details on the scipy.stats and numpy.percentile functions, see the documents at http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mquantiles.html and http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html.

Deviation and variance

A fundamental idea in the discussion of variability is the concept of deviation. Simply put, a deviation measure tells us how far away a given value is from the mean of the distribution, that is, Deviation and variance.

To find the deviation of a set of values, we define the variance as the sum of squared deviations and normalize it by dividing it by the size of the dataset. This is referred to as the variance. We need to use the sum of squared deviations as taking the sum of deviations around the mean results in 0 since the negative and positive deviations cancel each other out. The sum of squared deviations is defined as follows:

Deviation and variance

It can be shown that the preceding expression is equivalent to:

Deviation and variance

Formally, the variance is defined as follows:

  • For sample variance, use the following formula:
    Deviation and variance
  • For population variance, use the following formula:
    Deviation and variance

The reason why the denominator is Deviation and variance for the sample variance instead of Deviation and variance is that for sample variance, we wish to use an unbiased estimator. For more details on this, take a look at http://en.wikipedia.org/wiki/Bias_of_an_estimator.

The values of this measure are in squared units. This emphasizes the fact that what we have calculated as the variance is the squared deviation. Therefore, to obtain the deviation in the same units as the original points of the dataset, we must take the square root, and this gives us what we call the standard deviation. Thus, the standard deviation of a sample is given by using the following formula:

Deviation and variance

However, for a population, the standard deviation is given by the following formula:

Deviation and variance
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.255.178