Probability distributions

R makes it very easy to plot and get statistical information on many probability distributions. For those who are not familiar with probability distributions, they are defined as a table or an equation that links each outcome of a statistical experiment with its probability of occurrence. A summary of many common probability distributions available in R is available in the following table:

Probability distribution

R name

Beta

beta

Binomial

binom

Cauchy

cauchy

Chi square

chisq

Exponential

exp

F

f

Gamma

gamma

Geometric

geom

Hypergeometric

hyper

Logistic

logis

Lognormal

lnorm

Negative Binomial

nbinom

Normal

norm

Poisson

pois

Student t

t

Uniform

unif

Tukey

tukey

Weibull

weib

Wilcoxon

wilcox

You can also get this summary in R by entering help("distributions"). For additional probability distributions, and the packages needed to load them, you can consult the CRAN distributions page at http://cran.r-project.org/web/views/Distributions.html.

For each probability distribution, you can obtain the function that generates the mass or the probability function by adding the d prefix, the cumulative density function by adding the p prefix, and the quantile function by adding the q prefix to the R name, shown in the previous table. You can also generate random numbers from these probability distributions by adding the r prefix to the R name. For example, you can use qnorm() to call the quantile function for a normal distribution and rpois() to generate random numbers from a Poisson distribution.

For the 0.65 quantile of a normal distribution with a mean of 7.5 and standard deviation of 4, we would enter:

> qnorm(0.65, mean=7.5, sd=4)
[1] 9.041282

To generate seven random numbers from a Poisson distribution with a lambda equal to 4, we would enter:

> rpois(7, lambda=4)
[1] 2 3 5 4 6 3 5

Now, let's consider a more detailed example using probability distribution functions to solve a particular problem. Say the average number of liters of water consumed per day for children under the age of 12 has a normal distribution with a mean of 7.5 and a standard deviation of 3.5. Since the 68–95–99.7 rule (also known as the three-sigma rule or empirical rule) states that 99.7 percent randomly generated values will fall within three standard deviations of the mean in a normal distribution, we can approximate the interval values to be used for the x values in our plot, as follows:

> ld.mean <- 7.5
> ld.sd <- 1.5
> ld.mean+3*ld.sd
[1] 12
>  ld.mean-3*ld.sd
[1] 2

So, from these calculations, we can use an interval of [0, 16] because most random numbers generated will fall between 2 and 12:

> x <- seq(0, 16, length=100)

Next, we will use the dnorm() function, along with our mean and standard deviation, to return the density curve for average liters of water consumed per day for children under the age of 12:

> nd.height <- dnorm(x, mean = 7.5, sd = 1.5)

Now, we can plot the normal curve for probability distribution in R using the plot() function. We will set type = "l" in the plot() function to graph a line instead of points, as shown in the following command:

> plot(x, nd.height, type = "l", xlab = "Liters per day",  ylab = "Density", main = "Liters of water drank by school children < 12 years old")

The graph for this normal curve is shown in the following plot:

Probability distributions

Suppose we want to evaluate the probability of a child drinking less than 4 liters of water per day. We can get this information by measuring the area under the curve to the left of 4 using pnorm(), as shown in the following code to return the cumulative density function. Since we want to measure the area to the left of the curve, we set lower.tail=TRUE (default command) to the pnorm() function; otherwise, we will enter lower.tail=FALSE to measure the area to the right of the curve:

> pnorm(4, mean = 7.5, sd = 1.5, lower.tail = TRUE)
 [1] 0.009815329

We can plot the cumulative density function for x, as follows:

> ld.cdf <- pnorm(x, mean = 7.5, sd = 1.5, lower.tail = TRUE)
> plot(x, ld.cdf, type = "l", xlab = "Liters per day", ylab = "Cumulative Probability")

The result is shown in the following graph:

Probability distributions

We can also plot the cumulative probability of a child drinking more than 8 liters of water per day on our normal curve by setting upper and lower boundaries and then coloring in that area using the polygon() function. By looking at our cumulative density function plot (shown in the previous diagram), we can see that the probability of a child drinking more than 15 liters per day approaches zero so we can set our upper limit to 15.

Plot the normal curve using the plot() function, as follows:

>  plot(x, nd.height, type = "l", xlab = "Liters per day",  ylab = "Density")

Set the lower and upper limits, as follows:

>  ld.lower <- 8
>  ld.upper <- 15

Get all values of x that fall between 8 and 15:

>  i <- x >= ld.lower & x <= ld.upper #returns a logical vector

Now, we can highlight the area under the curve corresponding to the probability of a child drinking more than 8 liters of water in red with the polygon() function:

> polygon(c(ld.lower,x[i], ld.upper), c(0, nd.height [i],0), col="red")
> abline(h = 0, col = "gray")

Calculate the cumulative probability of a child drinking more than 8 liters of water per day:

> pb <- round(pnorm(8, mean = 7.5, sd = 1.5, lower.tail = FALSE)
> pb
[1] 0.37

Use the paste() function to create a character vector that will concatenate the pb value to our text:

> pb.results <- paste("Cumulative probabily of a child drinking > 8L/day", pb, sep=": ")

Add the pb.results text as the title of our plot:

> title(pb.results)

The result is shown in the following graph:

Probability distributions
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.29.71