Understanding distributions and transformation

Understanding probability distributions is important in order to have a clear idea about the assumptions of any statistical hypothesis test. For example, in linear regression analysis, the basic assumption is that the error distribution should be normally distributed and the variables' relationship should be linear. Hence, before moving to the stage of model formation, it is important to look at the shape of the distribution and types of transformations that one may look into to make the things right. This is done so that any further statistical techniques can be applied on the variables.

Normal probability distribution

The concept of normal distribution is based on Central Limit Theorem (CLT), which implies that the population of all possible samples of size n drawn from a population with mean μ and variance σ2 approximates a normal distribution with mean μ and σ2∕n when n increases towards infinity. Checking the normality of variables is important to remove outliers so that the prediction process does not get influenced. Presence of outliers not only deviates the predicted values but would also destabilize the predictive model. The following sample code and example show how to check normality graphically and interpret the same.

To test out the normal distribution, we can use the mean, median, and mode for some of the variables:

> mean(Cars93$Price)

[1] 19.50968

> median(Cars93$Price)

[1] 17.7

> sd(Cars93$Price)

[1] 9.65943

> var(Cars93$Price)

[1] 93.30458

> skewness(Cars93$Price)

[1] 1.483982

ggplot(data=Cars93, aes(Cars93$Price)) + geom_density(fill="blue")

From the preceding image, we can conclude that the price variable is positively skewed because of the presence of some outlier values on the right-hand side of the distribution. The mean of the price variable is inflated and greater than the mode because the mean is subject to extreme fluctuations.

Now let's try to understand a case where normal distribution can be used to answer any hypothesis.

Suppose the variable mileage per gallon on a highway is normally distributed with a mean of 29.08 and a standard deviation of 5.33. What is the probability that a new car would provide a mileage of 35?

> pnorm(35,mean(Cars93$MPG.highway),sd(Cars93$MPG.highway),lower.tail = F)

[1] 0.1336708

Hence the required probability that a new car would provide a mileage of 35 is 13.36%, since the expected mean is higher than the actual mean; the lower tail is equivalent to false.

Binomial probability distribution

Binomial distribution is known as discrete probability distribution. It describes the outcome of an experiment. Each trial is assumed to have only two outcomes: either success or failure, either yes or no. For instance, in the Cars93 dataset variable, whether manual transmission is available or not is represented as yes or no.

Let's take an example to explain where binomial distribution can be used. The probability of a defective car given a specific component is not functioning is 0.1%. You have 93 cars manufactured. What is the probability that at least 1 defective car can be detected from the current lot of 93:

> pbinom(1,93,prob = 0.1)

[1] 0.0006293772

So the required probability that a defective car can get identified in a lot of 93 cars is 0.0006, which is very less, given the condition that the probability of a defective part is 0.10.

Poisson probability distribution

Poisson distribution is for count data where, given the data and information about an event, you can predict the probability of any number occurring within that limit using the Poisson probability distribution.

Let's take an example. Suppose 200 customers on an average visit a particular e-commerce website portal every minute. Then find the probability of having 250 customers visit the same website in a minute:

> ppois(250,200,lower.tail = F)

[1] 0.0002846214

Hence, the required probability is 0.0002, which is very rare. Apart from the aforementioned common probability distributions, which are used very frequently, there are many other distributions that can be used in rare situations.

