Extending the linear framework

As discussed in the previous chapter, the basic idea underlying linear regression is that some variables' values can be predicted by the following equation describing a line:

Extending the linear framework

Here, the dependent variable Y has a linear relationship with a set of X values (that is, X values that are all raised to the power of 1). Of course, the various X values themselves can be nonlinear functions of other predictor variables; thus, by performing linear regression on nonlinear transformations of predictor variables we will be able to model nonlinear relationships in between variables.

Polynomial regression

The simplest way to extend the linear framework to nonlinear relationships is through polynomial regression. The idea here is that some of the predictor variables are squared or cubed, and the squares or cubes of these predictor variables are themselves treated as distinct predictors. For example, let's say that we want to fit a second degree polynomial, as shown by the equation:

Polynomial regression

This is not a linear regression formula but we can make it one by declaring a new variable B3 and setting it equal to Polynomial regression , as shown in the following equation:

Polynomial regression

In the preceding equation,

Polynomial regression

This way, we simply solve Polynomial regression, B 1, B2, and B 3 as we would in typical least squares linear regression.

Performing a polynomial regression in R

As can be visually seen in the previous plots, the relationship between height and age is clearly nonlinear. For starters, we just fit a line to the data to see what we get:

fit.linear <- lm(height ~ age)
summary(fit.linear)

Based on the R-squared value of 0.304, it might at first appear that a linear model fits the data well. We know that this is not true because visual inspection of the data, prior to model fitting, showed nonlinear data. The line fit to our dataset clearly misses the key features of the change in height in youth followed by a tapering of height in older age. Thus, it is a poor height versus age model, shown by:

plot(age, height, pch = 16, col = 'gray', xlab = 'Age', ylab = 'Height', main = 'Height vs Age')
points(age, fit.linear$fitted, pch = 16, cex = 0.1)

The result is shown in the following plot:

Performing a polynomial regression in R

This can also be seen using visual methods for regression diagnostics. Here, we will focus on the residuals versus fitted plots, which we will compare subsequent models against, as shown in the following code:

plot(fit.linear, which = 1)
Performing a polynomial regression in R

Here, we see large residuals for the fitted values with a clear pattern of very negative, then very positive followed by somewhat negative residuals.

How can we then fit a model in the absence of a theoretical formula relating these two variables? The answer is polynomial regression. This is in fact a kind of linear regression but the regression is performed on a combination of basis functions of x, where each basis function raises X to a higher level power.

Unfortunately, we don't actually know what kind of polynomial is required, so we will start with low order polynomials and advance. We use the function I() to tell R that the contents inside the parentheses are a mathematical function that should be interpreted independently and passed to the regression formula, as follows:

fit.quadratic <- lm(height ~ age + I(age^2))
plot(fit.quadratic, which = 1)
fit.cubic <- lm(height ~ age + I(age^2)+ I(age^3))
plot(fit.cubic, which = 1)
fit.quartic <- lm(height ~ age + I(age^2)+ I(age^3)+ I(age^4))
plot(fit.quartic, which = 1)
fit.quintic <- lm(height ~ age + I(age^2)+ I(age^3)+ I(age^4)+ I(age^5))
plot(fit.quintic, which = 1)
fit.sextic <- lm(height ~ age + I(age^2)+ I(age^3)+ I(age^4)+ I(age^5)+ I(age^6))
plot(fit.sextic, which = 1)
fit.septic <- lm(height ~ age + age + I(age^2)+ I(age^3)+ I(age^4)+ I(age^5) + I(age^6)+ I(age^7))
plot(fit.septic, which = 1)

For the sake of saving space, we don't show all of the plots; however, as can be seen in the following diagram, with each higher order polynomial, the residuals versus fitted plot improves:

Performing a polynomial regression in R

The approach of fitting high order polynomials might seem difficult to many readers to stomach for a variety of reasons. To begin with, it might just intuitively seem ridiculous. Additionally, once you get to a fourth order polynomial, it is not clear whether higher order polynomials are any better. However, we still had to get to a fourth order polynomial in our data modeling exercise. It is generally recommended to use polynomials of no more than cubic order to avoid the problem of nonconvergence.

Nonconvergence is a problem in many estimation algorithms, and a problem of high order polynomials in particular. Practically speaking, this problem can result in a polynomial that produces wildly fluctuating interpolated values. This can generally be avoided if no term in the polynomial is raised to a power higher than three. The problem lies in the distribution of interpolation points. Try the following code illustrating Runge's phenomenon (a prototypical example of the noncovergence problem), as an example:

Using data that has equally spaced interpolation points, as shown in the following code:

runge <- function(x) {return(1/(1+x^2))}
x<-seq(-5,5, 0.5)
y <- runge(x)
plot(y~x)
fit.runge <- lm(y~x+I(x^2)+I(x^3)+I(x^4)+I(x^5)+I(x^6))
lines(fit.runge$fitted ~ x)

The circles represent the actual points of the function, which one's intuition might suggest is a simple bell-shaped curve but the polynomial curve fit to these points oscillates wildly at the edges.

We can reduce this fluctuation by giving more interpolation points at the extremes of the interpolation interval using data with a high density of points at the extremes and a few points in the middle, as follows:

x2<-c(seq(-5,-4.05, 0.05),seq(-4, 4, 1), seq(4.05, 5, 0.05))
y2 <- runge(x2)
fit.runge.2 <- lm(y2~x2+I(x2^2)+I(x2^3)+I(x2^4)+I(x2^5)+I(x2^6))
lines(fit.runge.2$fitted ~ x2, col = 'red')

This technique helps to deal with noncovergence in made-up data, in which we get to choose interpolation points. Unfortunately, with real data, we don't get to choose where our interpolation points lie, so this solution can't be used.

The truth is that the relationship that we graphically see does not look like any common polynomial, and as such, it is really a bit unfair to apply polynomial regression to height versus age for all ages. A better approach might be to acknowledge that there are likely to be two distinct regimes; one in which people are growing and maturing and the other in which their growth has leveled off and might be declining. Let's start by breaking the sample into two age groups:

  • Those up to the age of 18, whom we expect to be growing
  • Those over the age of 18, for whom we expect growth has completed

The following code is used to break the samples:

detach(body.measures)
youths <- which(body.measures$age %in% c(2:18))
adults <- which(body.measures$age %in% c(19:80))
body.measures.youths <- body.measures[youths,]
body.measures.adults <- body.measures[adults,]
attach(body.measures.youths)
plot(height ~ age)

The result is shown in the following plot:

Performing a polynomial regression in R

We can understand from the previous plot that there is a curvilinear relationship between age and height. Furthermore, a visual inspection shows that this sort of curve looks like the kind that might be amenable to polynomial modeling. Let's have a look at the following code:

fit.cubic.youths <- lm(height ~ age + I(age^2)+ I(age^3))
plot(fit.cubic.youths, which = 1)

Here, we see a nice plot of residuals versus fitted, with residuals simply showing random errors rather than an organized pattern of misfit. Let's look at how the predicted model looks in comparison to the data:

plot(age, height, pch = 16, col = 'gray', xlab = 'Age', ylab = 'Height', main = 'Height vs Age (in youths)')
points(fit.cubic.youths$fitted ~ age, pch = 16, cex = 1)

The result is shown in the following code:

Performing a polynomial regression in R

We can also look at the standard model parameters and the proportion of variance explained, which suggests that this cubic spline has an R-squared of 0.93 (excellent value. We can use the same summary command as we did in linear regression (explained in Chapter 3, Linear Models):

summary(fit.cubic.youths)

We could then repeat the same procedure for those aged 19 and above:

detach(body.measures.youths)
attach(body.measures.adults)
fit.cubic.adults <- lm(height ~ age + I(age^2)+ I(age^3))
plot(age, height, pch = 16, col = 'gray', xlab = 'Age', ylab = 'Height', main = 'Height vs Age (in adults)')
points(fit.cubic.adults$fitted ~ age, pch = 16, cex = 1)

The result is shown in the following plot:

Performing a polynomial regression in R

We have, in effect, created two different functions for predicting height: one for those aged 18 and under and one for those aged 19 and older. This regression approach is the basic idea of splines.

Spline regression

To fit curve relationships without high degree polynomials, we can use splines. A spline is essentially a piece-wise function in which the regression formula changes based on the X value. There are a large number of approaches to splines and fortunately R has more sophisticated approaches than our previous approach.

Each point where there is a break in the regression function is called a "knot". Our previous approach had a single knot located at 18 to 19 years. In our piece-wise regression, the value at 18 and the estimated value at 19 could be completely different. True splines stipulate that there can be a continuous function at each knot.

As hinted at the beginning of this chapter, the role of nonlinear methods is debatable and as we progressed, we started to use models in a lesser theory-driven fashion and more in an exploratory manner. In line with this progression, splines make use of a parametric framework but they break the data into pieces to do this, so it takes more than just a few parameters to summarize a function.

One of the most straightforward functions for fitting splines in R is the smooth.spline function available in the R base. To demonstrate this, we will go back to the full dataset, as follows:

detach(body.measures.adults)
attach(body.measures)
fit.spline.smooth <- smooth.spline(height ~ age, nknots = 4)
plot(age, height, pch = 16, col = 'gray', xlab = 'Age', ylab = 'Height', main = 'Height vs Age')
lines(fit.spline.smooth, pch = 16)

The result is shown in the following plot:

Spline regression

This gives us a smooth representation of the plot that appears to fit the quirks in the data. However, it does not tell us anything about the mathematical properties of the data, making it most well suited for exploratory analyses.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.14.245