Linear regression

Regression analysis is a statistical method used to estimate the relationship among continuous variables. Linear regression is the simplest and most frequently used type of regression analysis. The aim of linear regression is to describe the response variable y through a linear combination of one or more explanatory variables x1, x2, x3, …, xp. In other words, the explanatory variables get weighted with constants and then summarized. For example, the simplest linear model is y = a + bx, where the two parameters a and b are the intercept and slope, respectively. The model formula for this relationship in R is y ~ x. Note that all parameters are left out. So, if our linear model was y = a + bx + cz, then our model formula will be y ~ x + z.

Plotting a slope

Before going into a detailed example of linear regression analysis, I think it's worth going over how to plot a slope (or the change in y divided by the change in x). Say you wanted to plot a slope given the Cartesian coordinates (3, 7) and (4, 4). The values for x that we need to plot are 3 and 4 and the values for y are 7 and 4. Let's take a look at the following lines of code:

>plot(c(3,4), c(7,4), ylab="y", xlab="x", main="Slope from coordinates (3, 7) and (4,4)", ylim=c(0,10), xlim=c(0, 10))

Note

The plot arguments do not take Cartesian coordinates but a vector for the x coordinates and a vector for the y coordinates.

To draw the change in y and consequent change in x, we use the lines() function as follows:

>lines(c(3,3), c(7,4))
>lines(c(3,4), c(4,4))

We can add "delta y" and "delta x" to the plot by using the text() function and the first two arguments to specify the location where the text should be added, that is, at x = 2 and y = 5.5 for delta y and x = 3.5 and y = 3.5 for delta x. Let's take a look at this in the following lines of code:

>text(2,5.5, "delta y") 
>text(3.5,3.5, "delta x")

#To plot a red line of width=3 between the two coordinates
> lines(c(3,4), c(7,4), col="red", lwd=3)

To calculate the slope b, we divide the change in y by the corresponding change in x, which in this case is (4-7)/(4-3), as follows:

> (4-7)/(4-3)
[1] -3

Then, we can calculate the intercept by solving a in the y = a + bx equation. Since we know that y = 4 and x = 4 from our coordinates (4, 4) and we also know that our slope is -3, we can solve a using the a = y – bx equation. Let's take a look at this in the following lines of code:

> 4 - (-3)*(4)
[1] 16

Now we can add the slope to our plot using the abline() function, where the first argument specifies the intercept and the second one specifies the slope as follows:

> abline(16, -3)

The result is shown in the following plot:

Plotting a slope

Now, let's look at an example to perform linear regression in R. We will use data generated from a quantitative real-time polymerase chain reaction (PCR) experiment to create a standard curve to quantify GAPDH expression. In this example, we expect the relationship between the quantity of RNA and the threshold cycle values (Ct) to be linear. The Ct values were measured in triplicate and stored in columns A1, A2, and A3. From this data we can get the maximum likelihood estimates for the slope and intercept of our standard curve by performing regression analysis.

First, we load the data into a data frame and check its structure to make sure the data in RNA_ng, A1, A2, A3 columns are stored as numerical vectors so that they will be considered as continuous variables by R in our linear regression analysis. Let's take a look at this in the following lines of code:

> gapdh.qPCR <- read.table(header=TRUE, text='
 GAPDH  RNA_ng  A1  A2  A3
std_curve  50  16.5  16.7  16.7
std_curve  10  19.3  19.2  19
std_curve  2  21.7  21.5  21.2
std_curve  0.4  24.5  24.1  23.5
std_curve  0.08  26.7  27  26.5
std_curve  0.016  36.5  36.4  37.2
 ')
> str(gapdh.qPCR)
'data.frame':  6 obs. of  5 variables:
 $ GAPDH : Factor w/ 1 level "std_curve": 1 1 1 1 1 1
 $ RNA_ng: num  50 10 2 0.4 0.08 0.016
 $ A1    : num  16.5 19.3 21.7 24.5 26.7 36.5
 $ A2    : num  16.7 19.2 21.5 24.1 27 36.4
 $ A3    : num  16.7 19 21.2 23.5 26.5 37.2

Since A1, A2, and A3 can be considered as levels of one explanatory variable Ct_Value, let's transform our gapdh.qPCR data frame from wide to long format using the melt() function from the reshape2 package. Let's take a look at this in the following lines of code:

> library("reshape2")
> gapdh.qPCR <- melt(gapdh.qPCR, id.vars=c("GAPDH", "RNA_ng"), value.name="Ct_Value")
> str(gapdh.qPCR)
'data.frame':  18 obs. of  4 variables:
 $ GAPDH   : Factor w/ 1 level "std_curve": 1 1 1 1 1 1 1 1 1 1 ...
 $ RNA_ng  : num  50 10 2 0.4 0.08 0.016 50 10 2 0.4 ...
 $ variable: Factor w/ 3 levels "A1","A2","A3": 1 1 1 1 1 1 2 2 2 2 ...
 $ Ct_Value: num  16.5 19.3 21.7 24.5 26.7 36.5 16.7 19.2 21.5 24.1 ...

By using the attach() function, we can simplify this task by referring to the gapdh.qPCR vectors by using the column name only (for example, RNA_ng) instead of the long way (gapdh.qPCR$RNA_ng) or by including the data argument (data=gapdh.qPCR). Let's take a look at the following lines of code:

> attach(gapdh.qPCR)
> names(gapdh.qPCR)
[1] "GAPDH"    "RNA_ng"   "variable" "Ct_Value"

Now we can easily plot the relationship between our Ct values and the quantity of RNA by plotting Ct_Values as a function of RNA_ng as follows:

> #plots two graphs side by side
> par(mfrow=c(1,2)) 
> plot(RNA_ng, Ct_Value)

As you can see, if we plot the data as is, the data seems to follow a curved model, which we want to avoid in order to follow the principle of parsimony. So let's try and log transform the RNA_ng explanatory variable as follows:

> plot(log(RNA_ng), Ct_Value)

The result is shown in the following plot:

Plotting a slope

Now, our data is ready to fit a linear model using the lm() function as follows:

> lm(Ct_Value ~ log(RNA_ng))

Call:
lm(formula = Ct_Value ~ log(RNA_ng))

Coefficients:
(Intercept)  log(RNA_ng)  
      23.87        -2.23

R reports the intercept a and slope b as coefficients. In this model, y = a + bx, the maximum likelihood estimate for a is 23.87 and b is -2.23.

If we would like to get more information on our model, such as the residual standard error or the adjusted R-squared value, we can save the model as an object and use the summary() function to print these details and more. Let's take a look at this in the following lines of code:

> model <- lm(Ct_Value ~ log(RNA_ng))
> summary(model)

Call:
lm(formula = Ct_Value ~ log(RNA_ng))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0051 -1.7165 -0.1837  1.4992  4.1063 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  23.8735     0.5382   44.36  < 2e-16 ***
log(RNA_ng)  -2.2297     0.1956  -11.40 4.33e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.282 on 16 degrees of freedom
Multiple R-squared:  0.8903,  Adjusted R-squared:  0.8835 
F-statistic: 129.9 on 1 and 16 DF,  p-value: 4.328e-09

You can also see the Anova table for the analysis using the summary.aov() function as follows:

> summary.aov(model)
            Df Sum Sq Mean Sq F value   Pr(>F)    
log(RNA_ng)  1  676.1   676.1   129.9 4.33e-09 ***
Residuals   16   83.3     5.2                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can check that your data meets the assumptions of a linear model, which are that the variance is constant and the errors are normally distributed by plotting the model object as follows:

> par(mfrow=c(2,2)) 
> plot(model)

The result is shown in the following plot:

Plotting a slope

In the preceding plot, the four graphs allow you to inspect your model in more detail. The first plot shows the residuals against the fitted values. For your model to be valid, the residuals (or error terms) must have a mean of zero and constant variance, which means that the residuals should be scattered randomly around zero. In this example, the error seems to be scattered randomly. Of course, with so few points it's not so clear, but ideally you want to avoid situations where the residuals increase or decrease with fitted values since these patterns indicate that the errors may not have constant variance. The second plot is a qqnorm plot, which allows us to ensure that the errors are normally distributed. Ideally, your points should be aligned in a straight line and avoid skewed distributions. The third plot is a repeat of the first plot except that it shows the square root of the residuals against the fitted values. Again, you want to avoid trends in the data. For example, we would have a problem if all the points were distributed in a triangle where the scatter of the residuals increases with the fitted values. Finally, the last plot shows us the Cook's distance in the residuals against leverage plot for each value of the response variable. The Cook's distance plot highlights the points with the greatest influence on the parameter estimates. In this case, the point with the most influence on our model is 18. To see the values for this point, we just need to use its index, which is 18, as follows:

> RNA_ng[18]
[1] 0.016
> Ct_Value[18]
[1] 37.2

Say we wanted to see the effect of removing this point on the adjusted R-squared value by updating our model and getting the summary as follows:

> model2 <- update(model, subset=(Ct_Value !=37.2))
> summary(model2)

Call:
lm(formula = Ct_Value ~ log(RNA_ng), subset = (Ct_Value != 37.2))

Residuals:
   Min     1Q Median     3Q    Max 
-2.373 -1.422 -0.470  1.033  4.275 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  23.6135     0.4970   47.51  < 2e-16 ***
log(RNA_ng)  -2.0825     0.1878  -11.09 1.26e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.047 on 15 degrees of freedom
Multiple R-squared:  0.8913,  Adjusted R-squared:  0.8841 
F-statistic:   123 on 1 and 15 DF,  p-value: 1.259e-08

As you can see, by removing the (0.016, 37.2) point from the mode, we only slightly increase the adjusted R-squared from 0.8835 to 0.8841. We can also look at the effect of this change on the slope of the curve by using the following lines of code:

> model2

Call:
lm(formula = Ct_Value ~ log(RNA_ng), subset = (Ct_Value != 37.2))

Coefficients:
(Intercept)  log(RNA_ng)  
     23.613       -2.083

As you can see, the slope went from -2.23 to -2.083. Here, we might decide to simply note that the point was influential or gather more data. However, if removing the data point significantly improved the model fit, we would leave the point out and use that model instead.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.182.159