Bootstrapping

The principle of (nonparametric) bootstrapping is to create a number of sample K of size N drawn with replacement from the original sample, where N is the original sample size. The parameters are estimated for each sample separately. This allows computing their confidence intervals, a measure of the variability of the parameters. Apart from making deviations from normal distributions less problematic, using bootstrapping is useful for samples that have a small number of observations (less than 100), as with ours.

We will discuss bootstrapping in Chapter 14, Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML, but let's have a sneak-peek now! Bootstrapping is easily performed using several functions in R—for instance, the boot() function in the boot package. But let's have a little fun and perform bootstrapping ourselves, 2,000 times. We will first generate the samples and obtain the estimates. We then display the estimates for the first six samples (rounded to the third decimal):

1  ret = data.frame(matrix(nrow=0, ncol=6))
2  set.seed(567)
3  for (i in 1:2000) {
4    data = nurses[sample(nrow(nurses), 40, replace = T),]
5    model_i <- lm(Commit ~ Exhaus + Depers + Accompl, 
6     data = data)
7    ret = rbind(ret,c(coef(model_i),summary(model_i)$r.square, 
8       summary(model_i)$fstatistic[1]))
9  } 
10  names(ret) = c("Intercept","Exhaus","Depers",
11     "Accomp","R2","F")
12  round(head(ret), 3)

The following is the output. As you can see, the values of the parameters are all different. This is because, as we mentioned, they are based on different samples. Yet, they are not far apart from the others:

Bootstrapping

Using the same seed number as before, we can see which observations of the original data were selected for, say, the first sample:

set.seed(567)
sample(nrow(nurses), 40, replace = T)

In the following output, we can see that the first and 21st observations of the original dataset appear three times, the 3rd, 11th, and 12th, while the others appear twice. The 6th, 9th, 10th, and others appear once. And, finally, observations 4, 5, 7, and others do not appear in this sample:

[1] 30 36 26 20 11 10  3 21 24 22 14 11 15 24  1  3 21 30  2 12
[21]  6 21  9 16  1 34 37 34 16 39 22 29  2 38 29 38 37 28 12  1

As we mentioned, generating several samples in this fashion allows you to get a sense of the variability of the parameters, and confidence intervals are a good way to determine this variability. So let's compute the 95 percent confidence intervals based on the data we just generated. The formula to compute confidence intervals for the mean is:

Bootstrapping

Here z is the threshold value of the 97.5th percentile (1 – (0.05/2)) in the z distribution. It is obtained with the following line of code:

qnorm(0.975)

So here we go:

1  CIs = data.frame(matrix(nrow = ncol(ret), ncol = 2))
2  for (j in 1:ncol(ret)) {
3     M = mean(ret[,j])
4     SD = sd(ret[,j])
5     lowerb = M - (1.96* (SD / sqrt(2000)))
6     upperb = M + (1.96* (SD / sqrt(2000)))  
7     CIs[j,1] = round(lowerb,3)
   CIs[j,2] = round(upperb,3)
8  }
9  names(CIs) = c("95% C.I.lower bound", "95% C.I.upper bound")
10  rownames(CIs) = colnames(ret)
11  CIs

The resulting confidence intervals are provided here:

Bootstrapping

The confidence intervals encompass all the values between the lower and upper bounds. We can see that no confidence interval contains 0, meaning that, with a 95 percent threshold, values reported are statistically different from 0 (more correctly put, there is only a 5 percent chance of observing values inside these bounds if the true value of the parameters in the population is 0). So we conclude that bootstrapped coefficients are different from 0, as is the multiple R-squared value.

As you might have noticed, the value to which to compare the confidence intervals for F is not 0, but a value that depends upon the degrees of freedom. We computed this value earlier and it was 2.866266. As the confidence interval for F does not include this value, we can be assured that the bootstrapped model predicts a significant part of variance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.120.57