Chapter 3. Linear Models

This chapter will cover linear models, which are probably the most commonly used statistical methods to study the relationships between variables. The generalized linear model section will delve into a bit more detail than typical R books, discussing the nature of link functions and canonical link functions:

  • Linear regression
  • Linear model fits
  • Analysis of variance models
  • Generalized linear models
  • Link functions and canonical link functions
  • Generalized additive models
  • Principal components analysis
  • Clustering
  • Discriminant analysis

An overview of statistical modeling

In order to explore the relationship between data and a set of experimental conditions, we often rely on statistical modeling. One of the central purposes of R is to estimate the fit of your data to a variety of models that you can easily optimize using several built-in functions and arguments. Although picking the best model to represent your data can be overwhelming, it is important to remember the principle of parsimony when choosing a model. Essentially, you should only include an explanatory variable in a model if it significantly improves the fit of a model. Therefore, our ideal model will try and fulfill most of the criteria in this list:

  • Contain n-1 parameters instead of n parameters
  • Contain k-1 explanatory variables instead of k variables
  • Be linear instead of curved
  • Not contain interactions between factors

In other words, we can simplify our model by removing non-significant interaction terms and explanatory variables, and by grouping together factor levels that do not differ from one another or add any new information to the model. In this chapter, we will go over the steps to fit your data to linear models, and in the following chapter, we will go over nonlinear methods to explore your data.

Model formulas

Before we begin, we will go over the model formula and conventions used in R for statistical modeling. This formula will then be the argument of the function defining the model. A detailed overview of statistical modeling in R is available in Chapter 7 of Michael J.Crawley's book Statistics An Introduction using R. Briefly, the basic structure of a model in R is specified as follows:

response_variable ~ explanatory_variable

The tilde (~) symbol signifies that the response variable is modeled as a function of the explanatory variable. The right side of the model formula specifies the following points:

  • The number of continuous or categorical explanatory variables and their attributes
  • Potential interactions between the explanatory variables
  • Nonlinear terms (if required)
  • The offset of error terms (in some special cases)

It is important to remember that both the response and explanatory variables can appear as transformations, or as powers, and polynomials. Also, the meaning of mathematical symbols is different from arithmetic expressions in model formulas. Let's take a look at the symbols used for statistical modeling in R in the following table:

Symbol

Explanation

+

Adds this explanatory variable

Deletes this explanatory variable

*

Includes this explanatory variables and interactions

/

Nests this explanatory variables

|

Denotes a condition; for example, y ~ x | z reads y as a function of x given z

:

Denotes an interaction; for example, A:B indicates the two-way interaction between A and B

I

Overrides the interpretation of a model symbol to use it as an arithmetic operator

Here are some examples to help understand how these symbols are used in statistical modeling.

For example, y ~ 1/x fits x nested within the intercept, whereas y ~ I (1/x) will fit 1/x as an explanatory variable.

For categorical variables A and B, y ~ A/B means fit A+B within A.

This formula can also be written as follows:

y ~ A + A:B

y ~ A + B %in% A

You can also specify the level of interactions to include using the ^ operator. For example, by writing (A+B+C)^3, you are telling R to fit all the main effects and all interactions up to level 3. In other words, (A+B+C)^3 can also be written as A*B*C or A+B+C+A:B+A:C+B:C+A:B:C. If you want to exclude the three-way interaction from A*B*C, then you could write (A+B+C)^2, which means to fit all the main effects and the two-way interactions but not the three-way interaction. Therefore, (A+B+C)^2 could also be written as A*B*C – A:B:C. The following table gives the examples of model formulas and their interpretation in R:

To fit A+B within A

y ~ A/B

y ~ A + A:B

y ~ A + B %in % A

To fit all the main effects and interactions

(A+B+C)^3

A*B*C

A+B+C+A:B+A:C+B:C+A:B:C

To fit all main effects and the two-way interactions excluding the three-way interaction

(A+B+C)^2

A*B*C – A:B:C

Explanatory variables interactions

Another thing to bear in mind when writing model formulas is that categorical variables are fitted differently from continuous variables. For example, y ~ A*B means to evaluate the A and B main effect means and the A:B interaction mean (that is, A + B + A:B). The number of interaction terms for A:B is (a – 1)(b – 1) for the two categorical variables, where a and b are the number of levels of the factors for A and B, respectively. So, if factor A has two levels and factor B has four levels, R would estimate (2-1)(4-1) = 3 parameters for the A:B interaction. In contrast, if x and z are continuous variables, then y ~ x*z tells R to fit x + z + x:z, where the x:z interaction behaves like a new variable that was computed from the point-wise product of the two vectors x and z explicitly calculated as xz.prod < x*z. Therefore, x + z + x:z can be rewritten as y ~ x + z + xz.prod.

In cases that we have interactions between categorical and continuous variables, such as y ~ A*x, where x is a continuous variable and A is a categorical variable with n levels, R will fit n regression equations and estimate n parameters from the data. In other words, n slopes and n intercepts.

Error terms

You can also include an error function as part of a model formula when there is nesting or pseudo replication. For example, to include the error to model a three-factorial experiment with categorical variables A, B, and C with three plot sizes and three different error variances, one for each plot, you would write y ~ A*B*C + Error(A/B/C).

The intercept as parameter 1

The null model has just one parameter, a constant, and indicates that y does not depend on any of the explanatory variables provided. The formula to fit your data to the null model is y ~ 1, where y is fitted to a mean of y. However, removing parameter 1 from categorical variables has a different effect. For example, if genotype is a three-level categorical variable, the model formula y ~ genotype – 1 will give the mean for each genotype rather than the overall mean in the summary table.

When fitting a linear model, y ~ x – 1 specifies a line through the origin. In other words, by removing parameter 1, we are forcing the regression line through the origin.

Updating a model

You can easily make a modification to a model in R by using the update() function. By using the . character on the right-hand side of the tilde, you specify that you'll use use the model as it is. So, if your original model is defined as model <- lm(y ~ A*B*C), you can remove the A:B interaction term as follows:

> model2 <- update(model, ~ . – A:B) 
# no need to repeat the response variable y

Note

For more details and information on the formula class in R, you can consult the help page by entering > ?formula.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.187.223