Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Models for count data

Logistic regression can handle only binary responses. If you have count data, such as the number of deaths or failures in a given period of time, or in a given geographical area, you can use Poisson or negative binomial regression. These data types are particularly common when working with aggregated data, which is provided as a number of events classified in different categories.

Poisson regression

Poisson regression models are generalized linear models with the logarithm as the link function, and they assume that the response has a Poisson distribution. The Poisson distribution takes only integer values. It is appropriate for count data, such as events occurring over a fixed period of time, that is, if the events are rather rare, such as a number of hard drive failures per day.

In the following example, we will use the Hard Drive Data Sets for the year of 2013. The dataset was downloaded from https://docs.backblaze.com/public/hard-drive-data/2013_data.zip, but we polished and simplified it a bit. Each record in the original database corresponds to a daily snapshot of one drive. The failure variable, our main point of interest, can be either zero (if the drive is OK), or one (on the last day of the hard drive before failing).

Let's try to determine which factors affect the appearance of a failure. The potential predictive factors are the following:

model: The manufacturer-assigned model number of the drive
capacity_bytes: The drive capacity in bytes
age_month: The drive age in the average month
temperature: The hard disk drive temperature
PendingSector: A logical value indicating the occurrence of unstable sectors (waiting for remapping on the given hard drive, on the given day)

We aggregated the original dataset by these variables, where the freq variable denotes the number of records in the given category. It's time to load this final, cleansed, and aggregated dataset:

> dfa <- readRDS('SMART_2013.RData')

Take a quick look at the number of failures by model:

> (ct <- xtabs(~model+failure, data=dfa))
              failure
model             0    1    2    3    4    5    8
  HGST          136    1    0    0    0    0    0
  Hitachi      2772   72    6    0    0    0    0
  SAMSUNG       125    0    0    0    0    0    0
  ST1500DL001    38    0    0    0    0    0    0
  ST1500DL003   213   39    6    0    0    0    0
  ST1500DM003    84    0    0    0    0    0    0
  ST2000DL001    51    4    0    0    0    0    0
  ST2000DL003    40    7    0    0    0    0    0
  ST2000DM001    98    0    0    0    0    0    0
  ST2000VN000    40    0    0    0    0    0    0
  ST3000DM001   771  122   34   14    4    2    1
  ST31500341AS 1058   75    8    0    0    0    0
  ST31500541AS 1010  106    7    1    0    0    0
  ST32000542AS  803   12    1    0    0    0    0
  ST320005XXXX  209    1    0    0    0    0    0
  ST33000651AS  323   12    0    0    0    0    0
  ST4000DM000   242   22   10    2    0    0    0
  ST4000DX000   197    1    0    0    0    0    0
  TOSHIBA       126    2    0    0    0    0    0
  WDC          1874   27    1    2    0    0    0

Now, let's get rid of those hard-drive models that didn't have any failure, by removing all rows from the preceding table where there are only zeros beside the first column:

> dfa <- dfa[dfa$model %in% names(which(rowSums(ct) - ct[, 1] > 0)),]

To get a quick overview on the number of failures, let's plot a histogram on a log scale by model numbers, with the help of the ggplot2 package:

> library(ggplot2)
> ggplot(rbind(dfa, data.frame(model='All', dfa[, -1] )), 
+   aes(failure)) + ylab("log(count)") + 
+   geom_histogram(binwidth = 1, drop=TRUE, origin = -0.5)  + 
+   scale_y_log10() + scale_x_continuous(breaks=c(0:10)) + 
+   facet_wrap( ~ model, ncol = 3) +
+   ggtitle("Histograms by manufacturer") + theme_bw()

Now, it's time to fit a Poisson regression model to the data, using the model number as the predictor. The model can be fitted using the glm function with the option, family=poisson. By default, the expected log count is modeled, so we use the log link.

In the database, each observation corresponds to a group with a varying number of hard drives. As we need to handle the different group sizes, we will use the offset function:

> poiss.base <- glm(failure ~ model, offset(log(freq)),
+   family = 'poisson', data = dfa)
> summary(poiss.base)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7337  -0.8052  -0.5160  -0.3291  16.3495  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -5.0594     0.5422  -9.331  < 2e-16 ***
modelHitachi        1.7666     0.5442   3.246  0.00117 ** 
modelST1500DL003    3.6563     0.5464   6.692 2.20e-11 ***
modelST2000DL001    2.5592     0.6371   4.017 5.90e-05 ***
modelST2000DL003    3.1390     0.6056   5.183 2.18e-07 ***
modelST3000DM001    4.1550     0.5427   7.656 1.92e-14 ***
modelST31500341AS   2.7445     0.5445   5.040 4.65e-07 ***
modelST31500541AS   3.0934     0.5436   5.690 1.27e-08 ***
modelST32000542AS   1.2749     0.5570   2.289  0.02208 *  
modelST320005XXXX  -0.4437     0.8988  -0.494  0.62156    
modelST33000651AS   1.9533     0.5585   3.497  0.00047 ***
modelST4000DM000    3.8219     0.5448   7.016 2.29e-12 ***
modelST4000DX000  -12.2432   117.6007  -0.104  0.91708    
modelTOSHIBA        0.2304     0.7633   0.302  0.76279    
modelWDC            1.3096     0.5480   2.390  0.01686 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 22397  on 9858  degrees of freedom
Residual deviance: 17622  on 9844  degrees of freedom
AIC: 24717

Number of Fisher Scoring iterations: 15

First, let's interpret the coefficients. The model number is a discrete predictor, so we entered a number of dummy variables to represent it is as a predictor. The reference category is not present in the output by default, but we can query it at any time:

> contrasts(dfa$model, sparse = TRUE)
HGST         . . . . . . . . . . . . . .
Hitachi      1 . . . . . . . . . . . . .
ST1500DL003  . 1 . . . . . . . . . . . .
ST2000DL001  . . 1 . . . . . . . . . . .
ST2000DL003  . . . 1 . . . . . . . . . .
ST3000DM001  . . . . 1 . . . . . . . . .
ST31500341AS . . . . . 1 . . . . . . . .
ST31500541AS . . . . . . 1 . . . . . . .
ST32000542AS . . . . . . . 1 . . . . . .
ST320005XXXX . . . . . . . . 1 . . . . .
ST33000651AS . . . . . . . . . 1 . . . .
ST4000DM000  . . . . . . . . . . 1 . . .
ST4000DX000  . . . . . . . . . . . 1 . .
TOSHIBA      . . . . . . . . . . . . 1 .
WDC          . . . . . . . . . . . . . 1

So, it turns out that the reference category is HGST, and the dummy variables compare each model with the HGST hard drive. For example, the coefficient of Hitachi is 1.77, so the expected log-count for Hitachi drives is about 1.77 greater than those for HGST drives. Or, you can compute its exponent when speaking about ratios instead of differences:

> exp(1.7666)
[1] 5.850926

So, the expected number of failures for Hitachi drives is 5.85 times greater than for HGST drives. In general, the interpretation goes as: a one unit increase in X multiplies Y by exp(b).

Similar to logistic regression, let's determine the significance of the model. To do this, we compare the present model to the null model without any predictors, so the difference between the residual deviance and the null deviance can be identified. We expect the difference to be large enough, and the corresponding chi-squared test to be significant:

> lrtest(poiss.base)
Likelihood ratio test

Model 1: failure ~ model
Model 2: failure ~ 1
  #Df LogLik  Df  Chisq Pr(>Chisq)    
1  15 -12344                          
2   1 -14732 -14 4775.8  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And it seems that the model is significant, but we should also try to determine whether any of the model assumptions might fail.

Just like we did with the linear and logistic regression models, we have an independence assumption, where Poisson regression assumes the events to be independent. This means that the occurrence of one failure will not make another more or less likely. In the case of drive failures, this assumption holds. Another important assumption comes from the fact that the response has a Poisson distribution with an equal mean and variance. Our model assumes that the variance and the mean, conditioned on the predictor variables, will be approximately equal.

To decide whether the assumption holds, we can compare the residual deviance to its degree of freedom. For a well-fitting model, their ratio should be close to one. Unfortunately, the reported residual deviance is 17622 on 9844 degrees of freedom, so their ratio is much above one, which suggests that the variance is much greater than the mean. This phenomenon is called overdispersion.

Negative binomial regression

In such a case, a negative binomial distribution can be used to model an over-dispersed count response, which is a generalization of the Poisson regression since it has an extra parameter to model the over-dispersion. In other words, Poisson and the negative binomial models are nested models; the former is a subset of the latter one.

In the following output, we use the glm.nb function from the MASS package to fit a negative binomial regression to our drive failure data:

> library(MASS)
> model.negbin.0 <- glm.nb(failure ~ model,
+  offset(log(freq)), data = dfa)

To compare this model's performance to the Poisson model, we can use the likelihood ratio test, since the two models are nested. The negative binomial model shows a significantly better fit:

> lrtest(poiss.base,model.negbin.0)
Likelihood ratio test

Model 1: failure ~ model
Model 2: failure ~ model
  #Df LogLik Df Chisq Pr(>Chisq)    
1  15 -12344                        
2  16 -11950  1 787.8  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This result clearly suggests choosing the negative binomial model.

Multivariate non-linear models

So far, the only predictor in our model was the model name, but we have other potentially important information about the drives as well, such as capacity, age, and temperature. Now let's add these to the model, and determine whether the new model is better than the original one.

Furthermore, let's check the importance of PendingSector as well. In short, we define a two-step model building procedure with the nested models; hence we can use likelihood ratio statistics to test whether the model fit has significantly increased in both steps:

> model.negbin.1 <- update(model.negbin.0, . ~ . + capacity_bytes + 
+   age_month + temperature)
> model.negbin.2 <- update(model.negbin.1, . ~ . + PendingSector)
> lrtest(model.negbin.0, model.negbin.1, model.negbin.2)
Likelihood ratio test

Model 1: failure ~ model
Model 2: failure ~ model + capacity_bytes + age_month + temperature
Model 3: failure ~ model + capacity_bytes + age_month + temperature + 
    PendingSector
  #Df LogLik Df  Chisq Pr(>Chisq)    
1  16 -11950                         
2  19 -11510  3 878.91  < 2.2e-16 ***
3  20 -11497  1  26.84  2.211e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both of these steps are significant, so it was worth adding each predictor to the model. Now, let's interpret the best model:

> summary(model.negbin.2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7147  -0.7580  -0.4519  -0.2187   9.4018  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -8.209e+00  6.064e-01 -13.537  < 2e-16 ***
modelHitachi       2.372e+00  5.480e-01   4.328 1.50e-05 ***
modelST1500DL003   6.132e+00  5.677e-01  10.801  < 2e-16 ***
modelST2000DL001   4.783e+00  6.587e-01   7.262 3.81e-13 ***
modelST2000DL003   5.313e+00  6.296e-01   8.440  < 2e-16 ***
modelST3000DM001   4.746e+00  5.470e-01   8.677  < 2e-16 ***
modelST31500341AS  3.849e+00  5.603e-01   6.869 6.49e-12 ***
modelST31500541AS  4.135e+00  5.598e-01   7.387 1.50e-13 ***
modelST32000542AS  2.403e+00  5.676e-01   4.234 2.29e-05 ***
modelST320005XXXX  1.377e-01  9.072e-01   0.152   0.8794    
modelST33000651AS  2.470e+00  5.631e-01   4.387 1.15e-05 ***
modelST4000DM000   3.792e+00  5.471e-01   6.931 4.17e-12 ***
modelST4000DX000  -2.039e+01  8.138e+03  -0.003   0.9980    
modelTOSHIBA       1.368e+00  7.687e-01   1.780   0.0751 .  
modelWDC           2.228e+00  5.563e-01   4.006 6.19e-05 ***
capacity_bytes     1.053e-12  5.807e-14  18.126  < 2e-16 ***
age_month          4.815e-02  2.212e-03  21.767  < 2e-16 ***
temperature       -5.427e-02  3.873e-03 -14.012  < 2e-16 ***
PendingSectoryes   2.240e-01  4.253e-02   5.267 1.39e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(0.8045) family taken to be 1)

    Null deviance: 17587  on 9858  degrees of freedom
Residual deviance: 12525  on 9840  degrees of freedom
AIC: 23034

Number of Fisher Scoring iterations: 1

              Theta:  0.8045 
          Std. Err.:  0.0525 

 2 x log-likelihood:  -22993.8850.

Each predictor is significant—with a few exceptions of some contrast in model type. For example, Toshiba doesn't differ significantly from the reference category, HGST, when controlling for age, temperature, and so on.

The interpretation of the negative binomial regression parameters is similar to the Poisson model. For example, the coefficient of age_month is 0.048, which shows that a one month increase in age, increases the expected log-count of failures by 0.048. Or, you can opt for using exponentials as well:

> exp(data.frame(exp_coef = coef(model.negbin.2)))
                      exp_coef
(Intercept)       2.720600e-04
modelHitachi      1.071430e+01
modelST1500DL003  4.602985e+02
modelST2000DL001  1.194937e+02
modelST2000DL003  2.030135e+02
modelST3000DM001  1.151628e+02
modelST31500341AS 4.692712e+01
modelST31500541AS 6.252061e+01
modelST32000542AS 1.106071e+01
modelST320005XXXX 1.147622e+00
modelST33000651AS 1.182098e+01
modelST4000DM000  4.436067e+01
modelST4000DX000  1.388577e-09
modelTOSHIBA      3.928209e+00
modelWDC          9.283970e+00
capacity_bytes    1.000000e+00
age_month         1.049329e+00
temperature       9.471743e-01
PendingSectoryes  1.251115e+00

So, it seems that one month in a lifetime increases the expected number of failures by 4.9 percent, and a larger capacity also increases the number of failures. On the other hand, temperature shows a reversed effect: the exponent of the coefficient is 0.947, which says that one degree of increased warmth decreases the expected number of failures by 5.3 percent.

The effect of the model name can be judged on the basis of comparison to the reference category, which is HGST in our case. One may want to change this reference. For example, for the most common drive: WDC. This can be easily done by changing the order of the factor levels in hard drive models, or simply defining the reference category in the factor via the extremely useful relevel function:

> dfa$model <- relevel(dfa$model, 'WDC')

Now, let's verify if HGST indeed replaced WDC in the coefficients list, but instead of the lengthy output of summary, we will use the tidy function from the broom package, which can extract the most important features (for the model summary, take a look at the glance function) of different statistical models:

> model.negbin.3 <- update(model.negbin.2, data = dfa)
> library(broom)
> format(tidy(model.negbin.3), digits = 4)
                term   estimate std.error statistic    p.value
1        (Intercept) -5.981e+00 2.173e-01 -27.52222 9.519e-167
2          modelHGST -2.228e+00 5.563e-01  -4.00558  6.187e-05
3       modelHitachi  1.433e-01 1.009e-01   1.41945  1.558e-01
4   modelST1500DL003  3.904e+00 1.353e-01  28.84295 6.212e-183
5   modelST2000DL001  2.555e+00 3.663e-01   6.97524  3.054e-12
6   modelST2000DL003  3.085e+00 3.108e-01   9.92496  3.242e-23
7   modelST3000DM001  2.518e+00 9.351e-02  26.92818 1.028e-159
8  modelST31500341AS  1.620e+00 1.069e-01  15.16126  6.383e-52
9  modelST31500541AS  1.907e+00 1.016e-01  18.77560  1.196e-78
10 modelST32000542AS  1.751e-01 1.533e-01   1.14260  2.532e-01
11 modelST320005XXXX -2.091e+00 7.243e-01  -2.88627  3.898e-03
12 modelST33000651AS  2.416e-01 1.652e-01   1.46245  1.436e-01
13  modelST4000DM000  1.564e+00 1.320e-01  11.84645  2.245e-32
14  modelST4000DX000 -1.862e+01 1.101e+03  -0.01691  9.865e-01
15      modelTOSHIBA -8.601e-01 5.483e-01  -1.56881  1.167e-01
16    capacity_bytes  1.053e-12 5.807e-14  18.12597  1.988e-73
17         age_month  4.815e-02 2.212e-03  21.76714 4.754e-105
18       temperature -5.427e-02 3.873e-03 -14.01175  1.321e-44
19  PendingSectoryes  2.240e-01 4.253e-02   5.26709  1.386e-07

Note

Use the broom package to extract model coefficients, compare model fit, and other metrics to be passed to, for example, ggplot2.

The effect of temperature suggests that the higher the temperature, the lower the number of hard drive failures. However, everyday experiences show a very different picture, for example, as described at https://www.backblaze.com/blog/hard-drive-temperature-does-it-matter. Google engineers found that temperature was not a good predictor of failure, while Microsoft and the University of Virginia found that it had a significant effect. Disk drive manufacturers suggest keeping disks at cooler temperatures.

So, let's take a closer look at this interesting question, and we will have the temperature as a predictor of drive failure. First, let's classify temperature into six equal categories, and then we will draw a bar plot presenting the mean number of failures per categories. Note that we have to take into account the different groups' sizes, so we will weight by freq, and as we are doing some data aggregation, it's the right time to convert our dataset into a data.table object:

> library(data.table)
> dfa <- data.table(dfa)
> dfa[, temp6 := cut2(temperature, g = 6)]
> temperature.weighted.mean <- dfa[, .(wfailure = 
+     weighted.mean(failure, freq)), by = temp6] 
> ggplot(temperature.weighted.mean, aes(x = temp6, y = wfailure)) +  
+     geom_bar(stat = 'identity') + xlab('Categorized temperature') +
+     ylab('Weighted mean of disk faults') + theme_bw()

The assumption of linear relation is clearly not supported. The bar plot suggests using the temperature in this classified form, instead of the original continuous variable when entering the model. To actually see which model is better, let's compare those! Since they are not nested, we have to use the AIC, which strongly supports the categorized version:

> model.negbin.4 <- update(model.negbin.0, .~. + capacity_bytes +
+   age_month + temp6 + PendingSector, data = dfa)
> AIC(model.negbin.3,model.negbin.4)
               df      AIC
model.negbin.3 20 23033.88
model.negbin.4 24 22282.47

Well, it was really worth categorizing temperature! Now, let's check the other two continuous predictors as well. Again, we will use freq as a weighting factor:

> weighted.means <- rbind(
+     dfa[, .(l = 'capacity', f = weighted.mean(failure, freq)),
+         by = .(v = capacity_bytes)],
+     dfa[, .(l = 'age', f = weighted.mean(failure, freq)),
+         by = .(v = age_month)])

As in the previous plots, we will use ggplot2 to plot the distribution of these discrete variables, but instead of a bar plot, we will use a stair-line chart to overcome the issue of the fixed width of bar charts:

> ggplot(weighted.means, aes(x = l, y = f)) + geom_step() +
+   facet_grid(. ~ v, scales = 'free_x') + theme_bw() +
+   ylab('Weighted mean of disk faults') + xlab('')

The relations are again, clearly not linear. The case of age is particularly interesting; there seems to be highly risky periods in the hard drives' lifetime. Now, let's force R to use capacity as a nominal variable (it has only five values, so there is no real need to categorize it), and let's classify age into 8 equally sized categories:

> dfa[, capacity_bytes := as.factor(capacity_bytes)]
> dfa[, age8 := cut2(age_month, g = 8)]
> model.negbin.5 <- update(model.negbin.0, .~. + capacity_bytes +
+   age8 + temp6 + PendingSector, data = dfa)

According to the AIC, the last model with the categorized age and capacity is much better, and is the best fitting model so far:

> AIC(model.negbin.5, model.negbin.4)
               df      AIC
model.negbin.5 33 22079.47
model.negbin.4 24 22282.47

If you look at the parameter estimates, you can see that the first dummy variable on capacity significantly differ from the reference:

> format(tidy(model.negbin.5), digits = 3)
                          term estimate std.error statistic   p.value
1                  (Intercept)  -6.1648  1.84e-01 -3.34e+01 2.69e-245
2                    modelHGST  -2.4747  5.63e-01 -4.40e+00  1.10e-05
3                 modelHitachi  -0.1119  1.21e-01 -9.25e-01  3.55e-01
4             modelST1500DL003  31.7680  7.05e+05  4.51e-05  1.00e+00
5             modelST2000DL001   1.5216  3.81e-01  3.99e+00  6.47e-05
6             modelST2000DL003   2.1055  3.28e-01  6.43e+00  1.29e-10
7             modelST3000DM001   2.4799  9.54e-02  2.60e+01 5.40e-149
8            modelST31500341AS  29.4626  7.05e+05  4.18e-05  1.00e+00
9            modelST31500541AS  29.7597  7.05e+05  4.22e-05  1.00e+00
10           modelST32000542AS  -0.5419  1.93e-01 -2.81e+00  5.02e-03
11           modelST320005XXXX  -2.8404  7.33e-01 -3.88e+00  1.07e-04
12           modelST33000651AS   0.0518  1.66e-01  3.11e-01  7.56e-01
13            modelST4000DM000   1.2243  1.62e-01  7.54e+00  4.72e-14
14            modelST4000DX000 -29.6729  2.55e+05 -1.16e-04  1.00e+00
15                modelTOSHIBA  -1.1658  5.48e-01 -2.13e+00  3.33e-02
16 capacity_bytes1500301910016 -27.1391  7.05e+05 -3.85e-05  1.00e+00
17 capacity_bytes2000398934016   1.8165  2.08e-01  8.73e+00  2.65e-18
18 capacity_bytes3000592982016   2.3515  1.88e-01  1.25e+01  8.14e-36
19 capacity_bytes4000787030016   3.6023  2.25e-01  1.60e+01  6.29e-58
20                 age8[ 5, 9)  -0.5417  7.55e-02 -7.18e+00  7.15e-13
21                 age8[ 9,14)  -0.0683  7.48e-02 -9.12e-01  3.62e-01
22                 age8[14,19)   0.3499  7.24e-02  4.83e+00  1.34e-06
23                 age8[19,25)   0.7383  7.33e-02  1.01e+01  7.22e-24
24                 age8[25,33)   0.5896  1.14e-01  5.18e+00  2.27e-07
25                 age8[33,43)   1.5698  1.05e-01  1.49e+01  1.61e-50
26                 age8[43,60]   1.9105  1.06e-01  1.81e+01  3.59e-73
27                temp6[22,24)   0.7582  5.01e-02  1.51e+01  8.37e-52
28                temp6[24,27)   0.5005  4.78e-02  1.05e+01  1.28e-25
29                temp6[27,30)   0.0883  5.40e-02  1.64e+00  1.02e-01
30                temp6[30,33)  -1.0627  9.20e-02 -1.15e+01  7.49e-31
31                temp6[33,50]  -1.5259  1.37e-01 -1.11e+01  1.23e-28
32            PendingSectoryes   0.1301  4.12e-02  3.16e+00  1.58e-03

The next three capacities are more likely to cause failures, but the trend is not linear. The effect of age also does not seem to be linear. In general, aging increases the number of failures, but there are some exceptions. For example, drives are significantly more likely to have a failure in the first (reference) age group than in the second one. This finding is plausible since drives have a higher failure rate at the beginning of their operation. The effect of temperature suggests that the middle temperature (22-30 degrees Celsius) is more likely to cause failures than low or high temperatures. Remember that each effect is controlled for every other predictor.

It would also be important to judge the effect-size of different predictors, comparing them to each other. As a picture is worth a thousand words, let's summarize the coefficients with the confidence intervals in one plot.

First, we have to extract the significant terms from the model:

> tmnb5 <- tidy(model.negbin.5)
> str(terms <- tmnb5$term[tmnb5$p.value < 0.05][-1])
 chr [1:22] "modelHGST" "modelST2000DL001" "modelST2000DL003" ...

Then, let's identify the confidence intervals of the coefficients using the confint function and the good old plyr package:

> library(plyr)
> ci <- ldply(terms, function(t) confint(model.negbin.5, t))

Unfortunately, this resulting data frame is not yet complete. We need to add the term names, and also, let's extract the grouping variables via a simple, regular expression:

> names(ci) <- c('min', 'max')
> ci$term <- terms
> ci$variable <- sub('[A-Z0-9\]\[,() ]*$', '', terms, perl = TRUE)

And now we have the confidence intervals of the coefficients in a nicely formatted dataset, which can be easily plotted by ggplot:

> ggplot(ci, aes(x = factor(term), color = variable)) + 
+     geom_errorbar(ymin = min, ymax = max) + xlab('') +
+     ylab('Coefficients (95% conf.int)') + theme_bw() + 
+     theme(axis.text.x = element_text(angle = 90, hjust = 1),
+         legend.position = 'top')

It can be easily seen that although each predictor is significant, the size of their effects strongly differ. For example, PendingSector has just a slight effect on the number of failures, but age, capacity, and temperature have a much stronger effect, and the hard drive model is the predictor that best differentiates the number of failures.

As we have mentioned in the Logistic regression section, different pseudo R-squared measures are available for nonlinear models as well. We again warn you to use these metrics with reservation. Anyway, in our case, they uniformly suggest the model's explanative power to be pretty good:

> PseudoR2(model.negbin.6 )
        McFadden     Adj.McFadden        Cox.Snell       Nagelkerke 
       0.3352654        0.3318286        0.4606953        0.5474952 
McKelvey.Zavoina           Effron            Count        Adj.Count 
              NA        0.1497521        0.9310444       -0.1943522 
             AIC    Corrected.AIC 
   12829.5012999    12829.7044941

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Models for count data

Create new playlist

Sign In

Sign Up

Models for count data

Poisson regression

Negative binomial regression

Multivariate non-linear models

Note

Table of Contents for
Models for count data