Introductory statistics

Sometimes, numbers tell us more than pictures. After all, the name descriptive statistics tells you it is about describing something. Descriptive statistics describe a distribution of a variable. Inferential statistics tell you about the associations between variables. There are a plethora of possibilities for introductory statistics in R. However, before calculating the statistical values, let's quickly define some of the most popular measures of descriptive statistics.

The mean is the most common measure for determining the center of a distribution. It is also probably the most abused statistical measure. The mean does not mean anything without the standard deviation or some other measure, and it should never be used alone. Let me give you an example. Imagine there are two pairs of people. In the first pair, both people earn the same—let's say, $80,000 per year. In the second pair, one person earns $30,000 per year, while the other earns $270,000 per year. The mean salary for the first pair is $80,000, while the mean for the second pair is $150,000 per year. By just listing the mean, you could conclude that each person from the second pair earns more than either of the people in the first pair. However, you can clearly see that this would be a seriously incorrect conclusion.

The definition of a mean is simple; it is the sum of all values of a continuous variable divided by the number of cases, as shown in the following formula:

The median is the value that splits the distribution into two halves. The number of rows with a value lower than the median must be equal to the number of rows with a value greater than the median for a selected variable. If there are an odd number of rows, the median is the middle row. If the number of rows is even, the median can be defined as the average value of the two middle rows (the financial median), the smaller of them (the lower statistical median), or the larger of them (the upper statistical median).

The range is the simplest measure of the spread; it is the plain distance between the maximum value and the minimum value that the variable takes.

A quick review: a variable is an attribute of an observation represented as a column in a table.

The first formula for the range is:

You can split the distribution more—for example, you can split each half into two halves. This way, you get quartiles as three values that split the distribution into quarters. Let's generalize this splitting process. You start with sorting rows (cases, observations) on selected columns (attributes, variables). You define the rank as the absolute position of a row in your sequence of sorted rows. The percentile rank of a value is a relative measure that tells you what percentage of all (n) observations have a lower value than the selected value.

By splitting the observations into quarters, you get three percentiles (at 25%, 50%, and 75% of all rows), and you can read the values at those positions that are important enough to have their own names: the quartiles. The second quartile is, of course, the median. The first one is called the lower quartile and the third one is known as the upper quartile. If you subtract the lower quartile (the first one) from the upper quartile (the third one), you get the formula for the inter-quartile range (IQR):

Let's suppose for a moment you have only one observation (n=1). This observation is also your sample mean, but there is no spread at all. You can calculate the spread only if n exceeds 1. Only (n-1) pieces of information help you calculate the spread, considering that the first observation is your mean. These pieces of information are called degrees of freedom. You can also think of degrees of freedom as the number of pieces of information that can vary. For example, imagine a variable that can take five different discrete states. You need to calculate the frequencies of four states only to know the distribution of the variable; the frequency of the last state is determined by the frequencies of the first four states you calculated, and they cannot vary because the cumulative percentage of all states must equal 100.

You can measure the distance between each value and the mean value and call it the deviation. The sum of all distances gives you a measure of how spread out your population is. But you must consider that some of the distances are positive, while others are negative; actually, they mutually cancel themselves out, so the total gives you exactly zero. So there are only (n-1) deviations free; the last one is strictly determined by the requirement just stated. In order to avoid negative deviation, you can square them. So, here is the formula for variance:

This is the formula for the variance of a sample, used as an estimator for the variance of the population. Now, imagine that your data represents the complete population, and the mean value is unknown. Then, all the observations contribute to the variance calculation equally, and degrees of freedom make no sense. The variance for a population is defined in a similar way to the variance for a sample. You just use all n cases instead of n-1 degrees of freedom:

Of course, with large samples, both formulas return practically the same number. To compensate for having the deviations squared, you can take the square root of the variance. This is the definition of standard deviation (σ):

Of course, you can use the same formula to calculate the standard deviation for the population, and the standard deviation of a sample as an estimator of the standard deviation for the population; just use the appropriate variance in the formula.

You probably remember skewness and kurtosis from Chapter 2, Review of SQL Server Features for Developers. These two measures measure the skew and the peakedness of a distribution. The formulas for skewness and kurtosis are:

Where:

n: number of cases
v_i: i^th value
µ: mean
σ: standard deviation

These are the descriptive statistics measures for continuous variables. Note that some statisticians calculate kurtosis without the last subtraction, which is approximating 3 for large samples; therefore, a kurtosis around 3 means a normal distribution, neither significantly peaked nor flattened.

For a quick overview of discrete variables, you use frequency tables. In a frequency table, you can show values, the absolute frequency of those values, absolute percentage, cumulative percentage, and a histogram of the absolute percentage.

One very simple way to calculate most of the measures introduced so far is by using the summary() function. You can feed it with a single variable or with a whole data frame. For a start, the following code re-reads the CSV file in a data frame and correctly orders the values of the Education variable. In addition, the code attaches the data frame to the search path:

TM = read.table("C:\SQL2017DevGuide\Chapter13_TM.csv", 
                sep=",", header=TRUE, 
                stringsAsFactors = TRUE); 
attach(TM); 
Education = factor(Education, order=TRUE,  
                   levels=c("Partial High School",  
                            "High School","Partial College", 
                            "Bachelors", "Graduate Degree"));

Note that you might get a warning, "The following object is masked _by_ .GlobalEnv: Education". You get this warning if you didn't start a new session with this section. Remember, the Education variable was already ordered earlier in the code, and is already part of the global search path, and therefore hides or masks the newly read column, Education. You can safely disregard this message and continue with defining the order of the Education values again.

The following code shows the simplest way to get a quick overview of descriptive statistics for the whole data frame:

summary(TM);

The partial results are here:

CustomerKey   MaritalStatus Gender TotalChildren   NumberChildrenAtHome
Min.   :11000   M:10011     F:9133   Min.   :0.000   Min.   :0.000       
1st Qu.:15621   S: 8473     M:9351   1st Qu.:0.000   1st Qu.:0.000       
Median :20242                        Median :2.000   Median :0.000       
Mean   :20242                        Mean   :1.844   Mean   :1.004       
3rd Qu.:24862                        3rd Qu.:3.000   3rd Qu.:2.000       
Max.   :29483                        Max.   :5.000   Max.   :5.000

As mentioned, you can get a quick summary for a single variable as well. In addition, there are many functions that calculate a single statistic, for example, sd() to calculate the standard deviation. The following code calculates the summary for the Age variable, and then calls different functions to get the details. Note that the dataset was added to the search path. You should be able to recognize which statistic is calculated by which function from the function names and the results:

summary(Age); 
mean(Age); 
median(Age); 
min(Age); 
max(Age); 
range(Age); 
quantile(Age, 1/4); 
quantile(Age, 3/4); 
IQR(Age); 
var(Age); 
sd(Age);

Here are the results, with added labels for better readability:

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  28.00   36.00   43.00   45.41   53.00   99.00 
    
Mean  45.40981
Median      43
Min   28
Max   99
Range 28 99
25%   36 
75%   53 
IQR   17
Var   132.9251
StDev 11.52931

To calculate the skewness and kurtosis, you need to install an additional package. One possibility is the moments package. The following code installs the package, loads it in memory, and then calls the skewness() and kurtosis() functions from the package:

install.packages("moments"); 
library(moments); 
skewness(Age); 
kurtosis(Age);

The kurtosis() function from this package does not perform the last subtraction in the formula and therefore, a kurtosis of three means no peakedness. Here are the results:

0.7072522
2.973118

Another possibility to calculate the skewness and the kurtosis is to create a custom function. Creating your own function is really simple in R. Here is an example, together with a call:

skewkurt <- function(p){ 
  avg <- mean(p) 
  cnt <- length(p) 
  stdev <- sd(p) 
  skew <- sum((p-avg)^3/stdev^3)/cnt 
  kurt <- sum((p-avg)^4/stdev^4)/cnt-3 
  return(c(skewness=skew, kurtosis=kurt)) 
}; 
skewkurt(Age);

Note that this is a simple example, not taking into account all the details of the formulas, and not checking for missing values. Nevertheless, here is the result. Note that in this function, the kurtosis was calculated with the last subtraction in the formula and is therefore different from the kurtosis from the moments package for approximately 3:

skewness     kurtosis 
0.70719483  -0.02720354

Before finishing with these introductory statistics, let's calculate some additional details for discrete variables. The summary() function returns absolute frequencies only. The table() function can be used for the same task. However, it is more powerful, as it can also do cross tabulation of two variables. You can also store the results in an object and pass this object to the prop.table() function, which calculates the proportions. The following code shows how to call the last two functions:

edt <- table(Education); 
edt; 
prop.table(edt);

Here are the results:

Partial High School      High School   Partial College       Bachelors 
               1581             3294              5064            5356 
    Graduate Degree 
               3189 
    
Partial High School      High School   Partial College       Bachelors 
         0.08553343      0.17820818         0.27396667      0.28976412 
    Graduate Degree 
         0.17252759

Of course, there is a package that includes a function that gives you a more condensed analysis of a discrete variable. The following code installs the descr package, loads it to memory, and calls the freq() function:

install.packages("descr"); 
library(descr); 
freq(Education);

Here are the results:

                    Frequency Percent Cum Percent
Partial High School      1581   8.553       8.553
High School              3294  17.821      26.374
Partial College          5064  27.397      53.771
Bachelors                5356  28.976      82.747
Graduate Degree          3189  17.253     100.000
Total                   18484 100.000

Table of Contents for Introductory statistics

Create new playlist

Sign In

Sign Up

Table of Contents for
Introductory statistics