Regression analysis is a statistical method used to estimate the relationship among continuous variables. Linear regression is the simplest and most frequently used type of regression analysis. The aim of linear regression is to describe the response variable y through a linear combination of one or more explanatory variables x1, x2, x3, …, xp. In other words, the explanatory variables get weighted with constants and then summarized. For example, the simplest linear model is y = a + bx, where the two parameters a and b are the intercept and slope, respectively. The model formula for this relationship in R is y ~ x. Note that all parameters are left out. So, if our linear model was y = a + bx + cz, then our model formula will be y ~ x + z.
Before going into a detailed example of linear regression analysis, I think it's worth going over how to plot a slope (or the change in y divided by the change in x). Say you wanted to plot a slope given the Cartesian coordinates (3, 7) and (4, 4). The values for x that we need to plot are 3 and 4 and the values for y are 7 and 4. Let's take a look at the following lines of code:
>plot(c(3,4), c(7,4), ylab="y", xlab="x", main="Slope from coordinates (3, 7) and (4,4)", ylim=c(0,10), xlim=c(0, 10))
To draw the change in y and consequent change in x, we use the lines()
function as follows:
>lines(c(3,3), c(7,4)) >lines(c(3,4), c(4,4))
We can add "delta y" and "delta x" to the plot by using the text()
function and the first two arguments to specify the location where the text should be added, that is, at x = 2 and y = 5.5 for delta y and x = 3.5 and y = 3.5 for delta x. Let's take a look at this in the following lines of code:
>text(2,5.5, "delta y") >text(3.5,3.5, "delta x") #To plot a red line of width=3 between the two coordinates > lines(c(3,4), c(7,4), col="red", lwd=3)
To calculate the slope b, we divide the change in y by the corresponding change in x, which in this case is (4-7)/(4-3), as follows:
> (4-7)/(4-3) [1] -3
Then, we can calculate the intercept by solving a in the y = a + bx equation. Since we know that y = 4 and x = 4 from our coordinates (4, 4) and we also know that our slope is -3, we can solve a using the a = y – bx equation. Let's take a look at this in the following lines of code:
> 4 - (-3)*(4) [1] 16
Now we can add the slope to our plot using the abline()
function, where the first argument specifies the intercept and the second one specifies the slope as follows:
> abline(16, -3)
The result is shown in the following plot:
Now, let's look at an example to perform linear regression in R. We will use data generated from a quantitative real-time polymerase chain reaction (PCR) experiment to create a standard curve to quantify GAPDH expression. In this example, we expect the relationship between the quantity of RNA and the threshold cycle values (Ct) to be linear. The Ct values were measured in triplicate and stored in columns A1, A2, and A3. From this data we can get the maximum likelihood estimates for the slope and intercept of our standard curve by performing regression analysis.
First, we load the data into a data frame and check its structure to make sure the data in RNA_ng
, A1, A2, A3 columns are stored as numerical vectors so that they will be considered as continuous variables by R in our linear regression analysis. Let's take a look at this in the following lines of code:
> gapdh.qPCR <- read.table(header=TRUE, text=' GAPDH RNA_ng A1 A2 A3 std_curve 50 16.5 16.7 16.7 std_curve 10 19.3 19.2 19 std_curve 2 21.7 21.5 21.2 std_curve 0.4 24.5 24.1 23.5 std_curve 0.08 26.7 27 26.5 std_curve 0.016 36.5 36.4 37.2 ') > str(gapdh.qPCR) 'data.frame': 6 obs. of 5 variables: $ GAPDH : Factor w/ 1 level "std_curve": 1 1 1 1 1 1 $ RNA_ng: num 50 10 2 0.4 0.08 0.016 $ A1 : num 16.5 19.3 21.7 24.5 26.7 36.5 $ A2 : num 16.7 19.2 21.5 24.1 27 36.4 $ A3 : num 16.7 19 21.2 23.5 26.5 37.2
Since A1, A2, and A3 can be considered as levels of one explanatory variable Ct_Value
, let's transform our gapdh.qPCR
data frame from wide to long format using the melt()
function from the reshape2
package. Let's take a look at this in the following lines of code:
> library("reshape2") > gapdh.qPCR <- melt(gapdh.qPCR, id.vars=c("GAPDH", "RNA_ng"), value.name="Ct_Value") > str(gapdh.qPCR) 'data.frame': 18 obs. of 4 variables: $ GAPDH : Factor w/ 1 level "std_curve": 1 1 1 1 1 1 1 1 1 1 ... $ RNA_ng : num 50 10 2 0.4 0.08 0.016 50 10 2 0.4 ... $ variable: Factor w/ 3 levels "A1","A2","A3": 1 1 1 1 1 1 2 2 2 2 ... $ Ct_Value: num 16.5 19.3 21.7 24.5 26.7 36.5 16.7 19.2 21.5 24.1 ...
By using the attach()
function, we can simplify this task by referring to the gapdh.qPCR
vectors by using the column name only (for example, RNA_ng
) instead of the long way (gapdh.qPCR$RNA_ng
) or by including the data
argument (data=gapdh.qPCR
). Let's take a look at the following lines of code:
> attach(gapdh.qPCR) > names(gapdh.qPCR) [1] "GAPDH" "RNA_ng" "variable" "Ct_Value"
Now we can easily plot the relationship between our Ct values and the quantity of RNA by plotting Ct_Values
as a function of RNA_ng
as follows:
> #plots two graphs side by side > par(mfrow=c(1,2)) > plot(RNA_ng, Ct_Value)
As you can see, if we plot the data as is, the data seems to follow a curved model, which we want to avoid in order to follow the principle of parsimony. So let's try and log transform the RNA_ng
explanatory variable as follows:
> plot(log(RNA_ng), Ct_Value)
The result is shown in the following plot:
Now, our data is ready to fit a linear model using the lm()
function as follows:
> lm(Ct_Value ~ log(RNA_ng)) Call: lm(formula = Ct_Value ~ log(RNA_ng)) Coefficients: (Intercept) log(RNA_ng) 23.87 -2.23
R reports the intercept a and slope b as coefficients. In this model, y = a + bx, the maximum likelihood estimate for a is 23.87 and b is -2.23.
If we would like to get more information on our model, such as the residual standard error or the adjusted R-squared value, we can save the model as an object and use the summary()
function to print these details and more. Let's take a look at this in the following lines of code:
> model <- lm(Ct_Value ~ log(RNA_ng)) > summary(model) Call: lm(formula = Ct_Value ~ log(RNA_ng)) Residuals: Min 1Q Median 3Q Max -3.0051 -1.7165 -0.1837 1.4992 4.1063 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 23.8735 0.5382 44.36 < 2e-16 *** log(RNA_ng) -2.2297 0.1956 -11.40 4.33e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.282 on 16 degrees of freedom Multiple R-squared: 0.8903, Adjusted R-squared: 0.8835 F-statistic: 129.9 on 1 and 16 DF, p-value: 4.328e-09
You can also see the Anova table for the analysis using the summary.aov()
function as follows:
> summary.aov(model) Df Sum Sq Mean Sq F value Pr(>F) log(RNA_ng) 1 676.1 676.1 129.9 4.33e-09 *** Residuals 16 83.3 5.2 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can check that your data meets the assumptions of a linear model, which are that the variance is constant and the errors are normally distributed by plotting the model object as follows:
> par(mfrow=c(2,2)) > plot(model)
The result is shown in the following plot:
In the preceding plot, the four graphs allow you to inspect your model in more detail. The first plot shows the residuals against the fitted values. For your model to be valid, the residuals (or error terms) must have a mean of zero and constant variance, which means that the residuals should be scattered randomly around zero. In this example, the error seems to be scattered randomly. Of course, with so few points it's not so clear, but ideally you want to avoid situations where the residuals increase or decrease with fitted values since these patterns indicate that the errors may not have constant variance. The second plot is a qqnorm
plot, which allows us to ensure that the errors are normally distributed. Ideally, your points should be aligned in a straight line and avoid skewed distributions. The third plot is a repeat of the first plot except that it shows the square root of the residuals against the fitted values. Again, you want to avoid trends in the data. For example, we would have a problem if all the points were distributed in a triangle where the scatter of the residuals increases with the fitted values. Finally, the last plot shows us the Cook's distance in the residuals against leverage plot for each value of the response variable. The Cook's distance plot highlights the points with the greatest influence on the parameter estimates. In this case, the point with the most influence on our model is 18. To see the values for this point, we just need to use its index, which is 18, as follows:
> RNA_ng[18] [1] 0.016 > Ct_Value[18] [1] 37.2
Say we wanted to see the effect of removing this point on the adjusted R-squared value by updating our model and getting the summary as follows:
> model2 <- update(model, subset=(Ct_Value !=37.2)) > summary(model2) Call: lm(formula = Ct_Value ~ log(RNA_ng), subset = (Ct_Value != 37.2)) Residuals: Min 1Q Median 3Q Max -2.373 -1.422 -0.470 1.033 4.275 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 23.6135 0.4970 47.51 < 2e-16 *** log(RNA_ng) -2.0825 0.1878 -11.09 1.26e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.047 on 15 degrees of freedom Multiple R-squared: 0.8913, Adjusted R-squared: 0.8841 F-statistic: 123 on 1 and 15 DF, p-value: 1.259e-08
As you can see, by removing the (0.016, 37.2) point from the mode, we only slightly increase the adjusted R-squared from 0.8835 to 0.8841. We can also look at the effect of this change on the slope of the curve by using the following lines of code:
> model2 Call: lm(formula = Ct_Value ~ log(RNA_ng), subset = (Ct_Value != 37.2)) Coefficients: (Intercept) log(RNA_ng) 23.613 -2.083
As you can see, the slope went from -2.23 to -2.083. Here, we might decide to simply note that the point was influential or gather more data. However, if removing the data point significantly improved the model fit, we would leave the point out and use that model instead.
3.15.182.159