Chapter 5. Building Models (authored by Renata Nemeth and Gergely Toth)

"All models should be as simple as possible... but no simpler."

– Attributed to Albert Einstein

"All models are wrong... but some are useful."

– George Box

After loading and transforming data, in this chapter, we will focus on how to build statistical models. Models are representations of reality, and, as the preceding citations emphasize, are always simplified representations. Although you can't possibly take everything into account, you should be aware about what to include and exclude in a good model that provides meaningful results.

In this chapter, regression models are discussed on the basis of linear regression models and standard modeling. Generalized Linear Models (GLM) extend these to allow the response variables to differ in distribution, which will be covered in the Chapter 6, Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth). In all, we will discuss the three most well known regression models:

  • Linear regression for continuous outcomes (birth weight measured in grams)
  • Logistic regression for binary outcomes (low birth weight versus normal birth weight)
  • Poisson regression for count data (number of low birth weight infants per year or per country)

Although there are many other regression models, such as Cox-regression which we will not discuss here, the logic in the building of the models and the interpretation are similar. So, after reading this chapter, you will be able to understand those without doubt.

By the end of this chapter, you will learn the most important things about regression models: how to avoid confounding, how to fit, how to interpret, and how to choose the best model among the many different options.

The motivation behind multivariate models

If you would like to measure the strength of association between a response and a predictor, you can choose a simple two-way association measure, such as correlation or the odds ratio, depending on the nature of your data. But, if your aim is to model a complex mechanism by taking into account other predictors as well, you will need regression models.

As Ben Goldacre, the evidence-based columnist for The Guardian, tells in his brilliant TED talk that the strong association between olive oil consumption and young looking skin does not imply that olive oil is beneficial to our skin. When modeling a complex association structure, we should also control for other predictors, such as smoking status or physical activity, because those who consume more olive oil are more likely to live a healthy life in general, so it may not be the olive oil itself that prevents skin wrinkles. In short, it seems that the kind of lifestyle is likely to confound the relationship between the variables of interest, making it appear that there might be causality, when in fact there is none.

Note

A confounder is a third variable that biases (increases or decreases) the association we are interested in. The confounder is always associated with both the response and the predictor.

If we examine the olive oil and skin wrinkles association again by fixing the smoking status, hence building separate models for smokers and non-smokers, the association may vanish. Holding the confounders fixed is the main idea behind controlling confounding via regression models.

Regression models in general are intended to measure associations between a response and a predictor by controlling for others. Potential confounders are entered into the model as predictors, and the regression coefficient of the predictor (the partial coefficient) measures the effect adjusted to the confounders.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.117.214