Chapter 8. Probability Distributions, Covariance, and Correlation

In Chapter 6, Dimensionality Reduction with Principal Component Analysis, we discussed principal component analysis. In the previous chapter, we discovered association rules using apriori in R. In this chapter, we will examine the following:

  • Probability distributions
  • A short introduction to descriptive statistics (mean and standard deviation)
  • Covariance and correlation, notably what they mean and how they are computed
  • How to perform correlation analysis in R

Probability distributions

In this section, we very briefly examine important distributions for common statistical problems with data consisting of quantities: the normal distribution and Student's t-distributions. We first introduce the idea of distributions with a discrete uniform distribution. We conclude with binomial distribution. We will try to be as non-technical as possible in this introduction to allow readers without statistical knowledge to follow easily; however, don't worry, we will be highly technical when explaining how to build functions that estimate correlations and regression coefficients.

Introducing probability distributions

Here, we introduce the idea of distributions using discrete uniform and binomial distributions.

Discrete uniform distribution

You might remember that, in Chapter 2, Visualizing and manipulating data using R, we examined outcomes of the roulette game. We showed that each of the 37 numbers (0 to 36) in European roulette has an equal probability of occurring, 1/37, that is approximately 0.02702. This is called a Bernoulli trial. The outcome of infinite draws would form a uniform distribution. Another example we have examined is rolling a die. We have shown that the probability of each number occurring is 1/6, that is, approximately 0.16667. If we rolled a die infinite times (or a large number of times), the histogram of the outcomes will show that each number has occurred an equal number of times. Let's examine this with the following code:

rolls = sample(6, size = 1000000, replace = TRUE)
hist(rolls)

Discrete uniform distribution

A histogram of a million die rolls

The normal distribution

Of course, not all attributes will follow such a distribution (in fact, most do not). Imagine the height of adults; you do not see as many people measuring, say, 140 cm, 180 cm, or 200 cm. Some heights are much more common than others, right? The normal distribution is usually applied to attributes such as height. The normal distribution acknowledges that some values of an attribute are much more likely to occur than other; values close to the arithmetic mean. The more a value is distant from the mean, the less likely it is to occur under the normal distribution. In fact, around 68 percent of the observations should have values between the mean minus one standard deviation and the mean plus one standard deviation, and 95 percent of observations should have values between the mean minus two standard deviations and the mean plus two standard deviations. It is important to know that the normal distribution assumes that the entire population is known, but it is widely used to analyze samples with a large number of observations.

The following code plots the shape (the probability density function) of the standard normal distribution (also called the z distribution), which has a mean of 0 and a standard deviation of 1:

curve(dnorm(x, 0 ,1), lwd = 2, xlim=c(-3,3), xlab="", ylab="",
   main = "The standard normal distribution")

The following diagram (at the top of the frame) presents this plot. You will notice that this distribution is symmetrical.

The normal distribution

The standard normal distribution (at the top of the frame) and a histogram of the heights of adults

Let's compare this shape to that of self-reported height (in inches). The data is from the Galton dataset of the HistData package:

install.packages("HistData")
library(HistData)
hist(Galton$parent, xlab="Height",
   main="Height of adults in inches")

The preceding diagram (bottom frame) presents this plot. We can see that the histogram of the height of adults is quite close to the shape of the normal distribution.

Inspecting data visually is not always enough. The Shapiro test (shapiro.test()) might be more informative. Let's create a vector that is not randomly distributed (x1) and another that is randomly distributed. We then test whether both attributes are normally distributed or not:

x1 = runif(1000)
x2 = rnorm(1000)
shapiro.test(x1)

The output for x1 is provided here:

        Shapiro-Wilk normality test
data:  x1
W = 0.957, p-value < 2.2e-16

Given the extremely low p-value, it is almost impossible to obtain this result if the data in the population the sample was drawn from was normally distributed. As p-value is lower than 0.05 (the usual threshold), we can conclude that the data is not normally distributed—just as we designed it to be. Let's examine x2:

shapiro.test(x2)

The output for x2 is provided here:

        Shapiro-Wilk normality test
data:  x2
W = 0.9988, p-value = 0.7777

The p-value is very different in this case. The probability (to obtain this result if the data, in the population that the sample was drawn from, was normally distributed) is about 77.7 percent. As the p-value is higher than 0.05, we conclude that the data is not different from a normal distribution.

The Student's t-distribution

The Student's t distribution resembles the normal distribution but is used when there are not many observations. What is important to know is that the sample size affects the shape of the t distribution. The degrees of freedom are the number of independent parameters that remain to be known before the data is fully known. Let's take the example of a sample of 10 observations on an attribute a. If we know the mean of the a attribute, we only need nine observations to know the value of the 10th, as we can rely on the mean and the nine known observations to infer the value of the remaining observation. The following plot shows the t distribution for 14 (in black) and 199 degrees of freedom (in gray) corresponding to sample sizes of 14 and 200:

curve(dt(x, 14), col = "black", lwd = 2, xlim=c(-3,3), xlab="",
   ylab="", main = "The t distribution")
curve(dt(x, 199), col = "grey", lwd = 2, add=T)

The Student's t-distribution

The t-distribution for 14 (in black) and 199 (in grey) degrees of freedom

The binomial distribution

Now imagine that we want to know the chances that a specific outcome (say, 6 for instance) will have; we will call these successes for a precise number times when throwing a die repeatedly (for example, 100 times). This kind of questioning requires us to draw upon the binomial distribution.

We can compute this as follows: we first obtain the binomial coefficient, which is computed as the factorial of the number of throws of dice, divided by the number of the expected number of successes, minus the difference between the number of throws of dice and the expected number of successes. We multiply this result by the probability of a single success to the power of the expected number of successes multiplied by 1 (the probability of a single success), to the power of the difference between the number of throws of dice and the expected number of outcomes. Sorry, this was a little complicated. In order to show this more practically, let's use R code to examine the probability that a six exactly appears different number of times (from 0 to 20) in 100 draws. We will rely on the choose() function to compute the binomial coefficient. Here is the code for computing and plotting the probabilities:

p = 1/6
N = 100
n = 1
v = rep(1,40)
for (n in 0:40) {
   v[n] = choose(N, n) * (p^n) * (1 - p)^(N-n)
}
plot(v, type="l", xlab = "Exact number of successes", ylab = 
  "Probability")

The resulting binomial distribution is displayed here:

The binomial distribution

Binomial distribution for 100 throws of dice (p = 1/6)

We can notice that we reach the highest probability of getting an exact number of successes when that number is around 16. It is almost impossible to get more than 30 successes in 100 throws of a fair dice.

The importance of distributions

Probability distributions are important because the significance of the tests performed are based upon them. Significance is tested by looking where the t or z values corresponding to the estimates (we will see how they are obtained later) lie on the distribution, with the corresponding degrees of freedom (for the t distribution). Further, most statistical tests, such as regression and correlation, assume the data is distributed normally. In most cases, a value must be at least in the extreme 5 percent of the distribution to be considered significant. This is the approach we will rely on in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.212.124