Chapter 6. Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)

Linear regression models, which we covered in the previous chapter, can handle continuous responses that have a linear association with the predictors. In this chapter, we will extend these models to allow the response variable to differ in distribution. But, before getting our hands dirty with the generalized linear models, we need to stop for a while and discuss regression models in general.

The modeling workflow

First, some words about the terminology. Statisticians call the Y variable the response, the outcome, or the dependent variable. The X variables are often called the predictors, the explanatory variables, or the independent variables. Some of the predictors are of our main interest, other predictors are added just because they are potential confounders. Continuous predictors are sometimes called covariates.

The GLM is a generalization of linear regression. GLM (also referred to as glm in R, from the stats package) allows the predictors to be related to the response variable via a link function, and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

Whatever regression model you use, the main question is, "in what form can we add continuous predictors to the model?" If the relationship between the response and the predictor does not meet the model assumptions, you can transform the variable in some way. For example, a logarithmic or quadratic transformation in a linear regression model is a very common way to solve the problem of non-linear relationships between the independent and dependent variables via linear formulas.

Or, you can transform the continuous predictor into a discrete one by subdividing its range in a proper way. When choosing the classes, one of the best options is to follow some convention, like choosing 18 as a cut-point in the case of age. Or you can follow a more technical way, for example, by categorizing the predictor into quantiles. An advanced way to go about this process would be to use some classification or regression trees, on which you will be able to read more in Chapter 10, Classification and Clustering.

Discrete predictors can be added to the model as dummy variables using reference category coding, as we have seen in the previous chapter for linear regression models.

But how do we actually build a model? We have compiled a general workflow to answer this question:

  1. First, fit the model with the main predictors and all the relevant confounders, and then reduce the number of confounders by dropping out the non-significant ones. There are some automatic procedures (such as backward elimination) for this.

    Note

    The given sample size limits the number of predictors. A rule of thumb for the required sample size is that you should have at least 20 observations per predictor.

  2. Decide whether to use the continuous variables in their original or categorized form.
  3. Try to achieve a better fit by testing for non-linear relationships, if they are pragmatically relevant.
  4. Finally, check the model assumptions.

And how do we find the best model? Is it as simple as the better the fit, the better the model? Unfortunately not. Our aim is to find the best fitting model, but with as few predictors as possible. A good model fit and a low number of independent variables are contradictory to each other.

As we have seen earlier, entering newer predictors into a linear regression model always increases the value of R-squared, and it may result in an over-fitted model. Overfitting means that the model describes the sample with its random noise, instead of the underlying data-generating process. Overfitting occurs, for example, when we have too many predictors in the model for its sample size to accommodate.

Consequently, the best model gives the desired level of fit with as few predictors as possible. AIC is one of those proper measures that takes into account both fit and parsimony. We highly recommend using it when comparing different models, which is very easy via the AIC function from the stats package.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.50.222