Discrete predictors

So far, we have seen only the simple case of both the response and the predictor variables being continuous. Now, let's generalize the model a bit, and enter a discrete predictor into the model. Take the usair data and add x5 (precipitation: average number of wet days per year) as a predictor with three categories (low, middle, and high levels of precipitation), using 30 and 45 as the cut-points. The research question is how these precipitation groups are associated with the SO2 concentration. The association is not necessary linear, as the following plot shows:

> plot(y ~ x5, data = usair, cex.lab = 1.5)
> abline(lm(y ~ x5, data = usair), col = 'red', lwd = 2.5, lty = 1)
> abline(lm(y ~ x5, data = usair[usair$x5<=45,]),
+   col = 'red', lwd = 2.5, lty = 3)
> abline(lm(y ~ x5, data = usair[usair$x5 >=30, ]),
+   col = 'red', lwd = 2.5, lty = 2)
> abline(v = c(30, 45), col = 'blue', lwd = 2.5)
> legend('topleft', lty = c(1, 3, 2, 1), lwd = rep(2.5, 4),
+   legend = c('y ~ x5', 'y ~ x5 | x5<=45','y ~ x5 | x5>=30',
+     'Critical zone'), col = c('red', 'red', 'red', 'blue'))
Discrete predictors

The cut-points 30 and 45 were more or less ad hoc. An advanced way to define optimal cut-points is to use a regression tree. There are various implementations of classification trees in R; a commonly used function is rpart from the package with the very same name. The regression tree follows an iterative process that splits the data into partitions, and then continues splitting each partition into smaller groups. In each step, the algorithm selects the best split on the continuous precipitation scale, where the best point minimizes the sum of the squared deviations from the group-level SO2 mean:

> library(partykit)
> library(rpart)
> plot(as.party(rpart(y ~ x5, data = usair)))
Discrete predictors

The interpretation of the preceding result is rather straightforward; if we are looking for two groups that differ highly regarding SO2, the optimal cut-point is a precipitation level of 45.34, and if we are looking for three groups, then we will have to split the second group by using the cut-point of 30.91, and so on. The four box-plots describe the SO2 distribution in the four partitions. So, these results confirm our previous assumption, and we have three precipitation groups that strongly differ in their level of SO2 concentration.

Tip

Take a look at Chapter 10, Classification and Clustering, for more details and examples on decisions trees.

The following scatterplot also shows that the three groups differ heavily from each other. It seems that the SO2 concentration is highest in the middle group, and the two other groups are very similar:

> usair$x5_3 <- cut2(usair$x5, c(30, 45))
> plot(y ~ as.numeric(x5_3), data = usair, cex.lab = 1.5,
+   xlab = 'Categorized annual rainfall(x5)', xaxt = 'n')
> axis(1, at = 1:3, labels = levels(usair$x5_3))
> lines(tapply(usair$y, usair$x5_3, mean), col='red', lwd=2.5, lty=1)
> legend('topright', legend = 'Linear prediction', col = 'red')
Discrete predictors

Now, let us refit our linear regression model by adding the three-category precipitation to the predictors. Technically, this goes by adding two dummy variables (learn more about this type of variable in Chapter 10, Classification and Clustering) pertaining to the second and third group, as shown in the table that follows:

 

Dummy variables

Categories

first

second

low (0-30)

0

0

middle (30-45)

1

0

high (45+)

0

1

In R, you can run this model using the glm (Generalized Linear Models) function, because the classic linear regression doesn't allow non-continuous predictors:

> summary(glmmodel.1 <- glm(y ~ x2 + x3 + x5_3, data = usair[-31, ]))
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-26.926   -4.780    1.543    5.481   31.280  

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       14.07025    5.01682   2.805  0.00817 ** 
x2                 0.05923    0.01210   4.897 2.19e-05 ***
x3                -0.03459    0.01172  -2.952  0.00560 ** 
x5_3[30.00,45.00) 13.08279    5.10367   2.563  0.01482 *  
x5_3[45.00,59.80]  0.09406    6.17024   0.015  0.98792    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 139.6349)

    Null deviance: 17845.9  on 39  degrees of freedom
Residual deviance:  4887.2  on 35  degrees of freedom
AIC: 317.74

Number of Fisher Scoring iterations: 2

The second group (wet days between 30 and 45) has a higher average by 15.2 units of SO2, as compared to the first group. This is controlled by the population size and number of manufacturers. The difference is statistically significant.

On the contrary, the third group shows only a slight difference when compared to the first group (0.04 unit lower), which is not significant. The three group mean shows a reversed U-shaped curve. Note that if you used precipitation in its original continuous form, implicitly you would assume a linear relation, so you wouldn't discover this shape. Another important thing to note is that the U-shaped curve here describes the partial association (controlled for x2 and x3), but the crude association, presented on the preceding scatterplot, showed a very similar picture.

The regression coefficients were interpreted as the difference between the group means, and both groups were compared to the omitted category (the first one). This is why the omitted category is usually referred to as the reference category. This way of entering discrete predictors is called reference-category coding. In general, if you have a discrete predictor with n categories, you have to define (n-1) dummies. Of course, if other contrasts are of interest, you can easily modify the model by entering dummies referring to other (n-1) categories.

Note

If you fit linear regression with discrete predictors, the regression slopes are the differences in the group means. If you also have other predictors, then the group-mean differences will be controlled for these predictors. Remember, the key feature of multivariate regression models is that they model partial two-way associations, holding the other predictors fixed.

You can go further by entering any other types and any number of predictors. If you have an ordinal predictor, it is your decision whether to enter it in its original form, assuming a linear relation, or to form dummies and enter each of them, allowing any type of relation. If you have no background knowledge on how to make this decision, you can try both solutions and compare how the models fit.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.209.201