Nonparametric nonlinear methods

We will begin by taking a moment to consider linear models in a very general context, since, many nonlinear models can be seen as a generalization of their linear counterparts.

We traditionally see the linear model expressed as an equation giving a variable Y in terms of a linear prediction terms BX, as follows:

Nonparametric nonlinear methods

The equation is the classic algebraic formula describing a line. However, we can describe regression, linear or otherwise, in a more general framework. What if we don't actually know the formula to relate Y and X to one another? We can still predict Y value for a given X value without any idea about the algebraic relationship between Y and X, so long as we simply rely on the observed X and Y values for predictions of Y given X, by taking the mean of observed Y values for any given X value:

Nonparametric nonlinear methods

The previous formula is a kind of point-wise regression, specifically point-wise mean regression. We define an Nonparametric nonlinear methods function, which is equal to the expected value of Y for a given value of X and this expected value is equal to the mean of all observed Y values for that particular value of X. The estimated value of Y for one particular X value is completely independent of the estimated value of Y for all other X values. We can implement a function to do this kind of regression in R relatively easily, as follows:

pointwise.regression <- function(x, y) {
..X <- c(min(x):max(x))
..Y <- vector('numeric', length(X))
..for (i in X) {
....Y[i-min(x)+1] <- mean(y[x == i])
..}
..return.frame <- data.frame(X, Y)
..return(return.frame)
}

This function takes two vectors of variables as inputs: x and y; extracts the mean value of y for each unique value of x and returns a data frame with each x and its corresponding mean y values. (Note the capitalization of x versus X and y versus Y in the function.)

Tip

There are a number of ways to use loops in R to iteratively construct data objects with multiple members (for example, vectors, data frames, or matrices). In general, the slowest way to do this is to lengthen the data object in each iteration, because this method requires reallocating the data object in the system memory with each iteration to accommodate the larger sized object. A much faster method is to simply create a data object of the required length and insert the appropriate values into their positions in the data object.

Using our point-wise regression function, we can create a plot of the mean expected height for each age value. By plotting lines based on these data points, R will connect each of the estimated points with a short straight line segment and we will, in a sense, have a nonparametric curve describing the trend in the data, as follows:

expected.height <- pointwise.regression(age, height)
plot(age, height, pch = 16, col = 'gray', xlab = 'Age', ylab = 'Height', main = 'Height vs Age')
lines(expected.height)

The result is shown in the following plot:

Nonparametric nonlinear methods

We can, of course, easily extend our point-wise regression function to put out a data frame with 95 percent confidence intervals of the estimated Y values based on the t-distribution or distribution of our choice:

pointwise.confint <- function(x, y) {
X <- c(min(x):max(x))
Y.list <- list('numeric', length(X))
for (i in X) {
t.temp <- t.test(y[x == i])
Y.list[[(i-min(x)+1)]] <- c(t.temp$estimate[[1]], t.temp$conf.int[1], t.temp$conf.int[2])
}
Y.mat <- do.call('rbind', Y.list)
return.frame <- data.frame(cbind(X, Y.mat))
names(return.frame) <- c('X', 'Y', 'Lower.Y', 'Upper.Y')
return(return.frame)
}

We will further discuss that there are other, often more useful, methods to model nonlinear relationships in data but the pointwise.confint function is useful because it is easily extensible; the elements of Y.list can be filled with any set of functions that the user would like.

Point-wise regression is a term that is not often used within data analysis but it is quite commonly used as a visual tool, going under names such as "box and whisker plots" or "candlestick charts". R contains a base function for doing box and whisker plots:

boxplot(height ~ age, xlab = 'Age', ylab = 'Height', main = 'Height vs Age')

The result is shown in the following plot:

Nonparametric nonlinear methods

The box and whisker plots have a rectangular box at each X value. The line through the middle of that box is the median, and the bottom and top of the box are the values of the first and third quartiles respectively. The whiskers, by default, extend out 1.5 times the interquartile range (that is, 1.5 times the length of the box), and the individual points represent outliers.

Tip

The boxplot command in R can take the add = TRUE argument, which will add the boxplot command to an already existing plot. However, the boxplot command will not line up with the existing plot unless X values for the pre-existing plot have started at one. If they have started at some other value, then the at argument must be passed to boxplot, as follows:

plot(height ~ age, col = 'gray', pch = 16, xlim = c(2, 80), xlab = '', ylab = '', xaxt = 'n')
boxplot(height ~ age, xlab = 'Age', ylab = 'Height', main = 'Height vs Age', add = TRUE, at = c(2:80))

The most notable feature of this point-wise approach to value estimation or the use of box and whisker plots is that it treats estimated Y values for each X value as completely independent of all Y values for all other X values. Of course, this is likely to ignore important information—a 50 year old male who is 5 feet tall today might have been 5'1" last year but he was likely not 6 feet tall. As such, it makes sense to model the Y value at any given X as having some relationship with adjacent X values as well. We do this by assuming a function, w, to describe the relationship between the observed data points. This adds an additional assumption to the model by requiring that the estimated Y value for each X value be tied somehow to the estimated Y value for other X values.

Nonparametric nonlinear methods

If we define w as follows (note the variance term in the denominator), we get a line:

Nonparametric nonlinear methods

In the next sections, we will discuss the use of weighting functions that relate Y at a given X to neighboring X values as well.

Point-wise regression will capture every kink in the direction of the relationship between two variables. This is good if we think every kink is important but more often than not, these kinks reflect sample-specific errors, which is why weighting functions are useful.

Kernel regression

Kernel smoothing regression is a flexible method to fit a smooth curve to nonlinear data. Instead of estimating the expected Y value at each X value as a function of that particular X alone, it uses a weighted distribution of surrounding points. How these surrounding points are weighted is determined by the particular kernel chosen.

As discussed in the earlier section on theory, regression can be thought of as a function of the sum of observed Y values multiplied by a w function. In the case of kernel density regression, the formula for the Nadaraya-Watson kernel regression formula can be expressed, as follows:

Kernel regression

In the preceding formula, K is a kernel function, and h is the bandwidth (that is, the smoothing parameter). For example, to apply Gaussian kernel smoothing, we use the probability density function for the normal distribution to describe K. The bandwidth can be either fixed or based on the nearest neighbors. A fixed bandwidth uses all points within the defined range while a nearest neighbor's bandwidth uses a proportion of the sample closest to the point of interest to create estimates. For example, if we used a kernel that weighted all values equally on the height versus age data, then a fixed bandwidth of 10 would be tantamount to estimating the height at a given age as the mean height for all ages within 10 years of the age of interest. Alternatively, if we applied the same kernel to a nearest neighbor's bandwidth of 0.25, then we would be estimating the height of someone at a given age as the mean height of the 25 percent of the sample closest the age of interest.

To apply kernel regression based on the Nadaraya-Watson formula in R, the simplest method is to use the built-in R function ksmooth, which uses fixed window bandwidths. This might be thought of as a variant of point-wise regression in an earlier section, in which the estimated Y value is not estimated based on a single point but on a group of adjacent points:

smooth.height <- ksmooth(age, height, bandwidth = 10, kernel = 'normal')

There are two notable arguments that we explicitly define here in addition to the arguments referencing the data, the bandwidth (h in the previous equation) and the kernel function (K in the previous equation). The bandwidth argument tells R how wide the "local" area is over which we want to regress data points. The ksmooth function requires manually declaring this bandwidth; larger bandwidths mean smoother curves and less chance of modeling noise, while smaller bandwidths can result in rougher curves. The choice of appropriate bandwidth is obviously an important concern. Here, a bandwidth of 10 has been arbitrarily chosen but methods for creating "optimal" bandwidths have been described and have implementations in R, which we will discuss shortly.

The kernel argument here is declared as "normal". The other kernel option is "box" (often called a "rectangular" kernel elsewhere). A normal kernel means that R samples the data with weights in a Gaussian distribution, while a box kernel means that R samples all points in the bandwidth with the same weight and all points outside of the bandwidth with a weight of zero.

Let's take a look at how a curve generated with kernel density regression fits the data. We will also ensure that the second plot looks a little cleaner, so that we can actually see the curve:

plot(age, height, xlab = 'Age', ylab = 'Height', main = 'Height vs Age in Males', col = 'gray', pch = 16)
lines(smooth.height, col = 'red')

The result is as shown in the following plot:

Kernel regression

The curve in the previous diagram appears to fit the data relatively well but it might not completely meet our needs. As indicated earlier, one of the reasons to use nonlinear methods is because a better fit to the data is required than what a linear model can provide, and while this curve fits the data relatively well, it appears to over-estimate height for those close to the minimum age. This is caused by excess smoothing, which we can ameliorate by decreasing the bandwidth:

plot(age, height, xlab = 'Age', ylab = 'Height', main = 'Height vs Age', col = 'gray', pch = 16)
smooth.height <- ksmooth(age, height, bandwidth = 2, kernel = 'normal')
lines(smooth.height, col = 'red')

The result is as shown in the following plot:

Kernel regression

After a few tries, it looks like a bandwidth of about two seems to match the data better, but now it appears to model noise in the data as well. The small decline in height that we previously saw in those of an advanced age is not so evident even now. Before discussing what can be done about this, it is worth taking a moment to think about what we are doing.

We know that children start out at a low height, grow until the end of puberty, and then remain the same height for most of their adult lives with perhaps a small loss of height as they reach advanced ages. We are trying to match a curve to our a-priori expectations about height and age. Notably, we have some expectations about the relationship between these two variables in general, but we also have some expectations about how the relationship between these two variables behaves within very narrow windows. The ksmooth function, unfortunately, only allows us to adjust the overall behavior of the kernel density estimator, which means that we are forced to choose between matching the curve to our general expectations or our very local expectations, however, we are unable to do both.

Kernel weighted local polynomial fitting

We previously discussed the R command ksmooth, which implements the Nadaraya-Watson kernel density regression function; it is a fast and simple way to identify trends in data with no a-priori knowledge or assumptions about the algebraic form of those trends. However, if we wish to do kernel smoothing regression that allows us to incorporate specific expectations about the relationship between the two variables for particular ranges of values, we will need to turn to the locpoly command in the KernSmooth package.

The locpoly command extends traditional kernel regression; rather than simply using a weighted mean of values as it is done in ksmooth, it uses kernel density weighted local polynomial regression. We will discuss polynomial regression further in this chapter but for now, we will point out that polynomial regression is basically like linear regression, except that you fit a curve with quadratic, cubic, or higher power terms. Typically, quadratic polynomials are used in kernel weighted regression but we could use linear terms or even zero-order terms, which would be identical to the traditional Nadaraya-Watson estimator.

A practical advantage that locpoly provides is that it allows us to assign bandwidths locally along the range of values of the variables that we are examining. In essence, we wish to be sensitive to height differences between years in the first two decades of life because we think that height differences here represent a real phenomenon, and we wish to be insensitive to differences in height between people of various ages for those beyond 20 years because we suspect that this largely reflects random errors. In kernel density estimation, this is achieved by adjusting the bandwidth, so we need a way to assign different bandwidths to different ages, which we can do with the locpoly command, which also uses fixed window bandwidths. This is shown by the following code:

library(KernSmooth)
plot(age, height, xlab = 'Age', ylab = 'Height', main = 'Height vs Age', col = 'gray', pch = 16)
bandwidth.vals <- c(rep(1, 20), rep(5, 30))
smooth.height <- locpoly(age, height, gridsize = 50, bandwidth = bandwidth.vals)
lines(smooth.height, col = 'red')

The locpoly command takes a few additional arguments that are not used in ksmooth, which we take advantage of here. In particular, we no longer specify a single bandwidth to be used over the entire range of x values but can now declare bandwidths to be as wide or narrow as we would like over the range of x values. The gridsize argument tells R how many equally spaced points the kernel estimation will be performed at; we must have a vector of bandwidths with the same length as the gridsize argument, in this case 50. We set the bandwidth of the first 20 points as one, and the bandwidth of the last 30 points as zero, effectively being very sensitive to changes in height at early ages and relatively insensitive at later ages. The curve that this gives us is still a little jagged but overall it appears to capture what we would likely regard as the salient features of the dataset: an increasing height in the first 15 to 20 years of life, followed by a rapid plateau in height, and a slight loss of height in later years.

Tip

A wider bandwidth effectively increases the sample size for kernel estimation at any given point. Therefore, it is also worth examining the number of individuals at each point, as can be done with the hexbin command, mentioned in the tip given prior to setting local bandwidths. In this case, the density of individuals represented on the left-hand side of the plot is relatively high but if there were relatively few individuals, it is not clear whether this would have worked as well.

Optimal bandwidth selection

So far, we have been tinkering with bandwidth values to get a desired curve—a perfectly reasonable approach to exploring relationships in data but an approach that hardly seems scientific. While this approach to understanding relationships in data is more often used in science than many researchers would care to admit, it is still nice to have a more systematic method to choose an optimal bandwidth.

There have been a number of bandwidth selection methods described over the past few decades, one of which currently is a plug-in bandwidth estimator, which seeks to minimize the mean squared error. The KernSmooth package contains a function for providing an optimal bandwidth based on the plug-in estimator, which can be done with the dpill function. This function allows us to estimate either a single optimal bandwidth for the entire region for which regression is being done or different bandwidths for different regions. An alternative approach to select the optimal bandwidth is based on cross-validation, which can be done using the npregbw function in the np package.

Using dpill from the KernSmooth package, as follows:

h <- dpill(age, height, gridsize = 80)
plot(age, height, xlab = 'Age', ylab = 'Height', main = 'Height vs Age', col = 'gray', pch = 16)
smooth.height <- locpoly(age, height, bandwidth = h, gridsize = 80, kernel = 'normal')
lines(smooth.height, col = 'red')

The result is shown in the following plot:

Optimal bandwidth selection

This does not give a perfectly smooth line but appears to adequately capture the trends in the data that we would expect to see, and it took no tinkering on our part to constrain the data to our preconceived notions.

Using cross-validation to select the optimal bandwidth is much more computationally intensive. As such, we will demonstrate it on a smaller subsample of this dataset in a subsequent section.

A practical scientific application of kernel regression

Thus far, we have used kernel smoothing regression only to visually explore and describe data rather than to address any specific scientific questions. Some might argue that this is how kernel smoothing is designed to be used, but nevertheless we will attempt to use kernel density regression to answer a particular question: At what age does growth slow down?

Operationally, this question is a matter of the second derivative of the function of height versus age. The first derivative of this function is the velocity of growth (for example, how much a person grows each year), and the second derivative is the acceleration in growth. What we are interested in is the age at which growth is decelerating the most. Since we know that men and women tend to hit puberty at different ages, we should probably separate them out and check if the growth of women slows down at a younger age than the growth of men. Luckily, the locpoly command supports this operation. We specify with the drv argument that we are interested in the second derivative (that is, acceleration):

> smooth.height.2.males <- locpoly(age[gender ==1], height[gender == 1], drv = 2, bandwidth = h, gridsize = 80, kernel = 'normal')
> smooth.height.2.males$x[smooth.height.2.males$y == min(smooth.height.2.males$y)]
[1] 13.8481
> smooth.height.2.females <- locpoly(age[gender ==2], height[gender == 2], drv = 2, bandwidth = h, gridsize = 80, kernel = 'normal')
> smooth.height.2.females$x[smooth.height.2.females$y == min(smooth.height.2.females$y)]
[1] 11.87342

These results suggest that women do, in fact, have their growth slow down at a younger age than men, as we expected.

Like point-wise regression, weighted regression can be used for exploratory analyses, albeit with a weighting function that allows some smoothing of the noise in the data. The addition of this weighting function also allows us to gain some insights into the relationship in the data beyond simple plotting.

Locally weighted polynomial regression and the loess function

In the previous section, we discussed methods to fit nonlinear data but the methods that we discussed are applicable only in cases in which we wish to model the relationship between two variables. In real data analysis, we often wish to model the relationship between a larger number of variables or model an outcome as a function of a number of predictors. For this, the ksmooth() and locpoly() functions simply won't do as they are limited to comparing only two variables. However, as we will see, R has a built-in function, loess, which is able to support kernel regression in multiple dimensions.

The loess command has a number of distinct differences between the kernel smoothing methods described earlier. Chiefly, loess uses a bandwidth based on nearest neighbors rather than a fixed bandwidth.

Tip

While we focus here on the loess command, R also contains a similar function, lowess, which is the predecessor to loess and does not support multiple predictor variables. The loess command can model up to four predictor variables and handle a Guaussian or rectangular kernel. Alternatively, lowess can only handle a single predictor variable and only supports a tricubic kernel.

We can model weight as a function of height and age. We will do this modeling here in the same manner as we declare a linear regression model (in this case only on males), as follows:

male.weight <- weight[gender == 1]
male.age <- age[gender == 1]
male.height <- height[gender == 1]
weight.fit <- loess(male.weight ~ male.age * male.height, span = 1, family = 'gaussian')

We can then use the model named weight.fit, which we created to estimate the expected weight of a person for a given age and height. However, unlike linear regression, which gives us a simple formula that we can compute by hand, knowing only a few coefficients, we will have to rely on the software to give us predictions for a potentially highly nonlinear model. To do this, we use the predict command and then plot the surface created but before we can do this, we must give new values to predict for use. By looking at the fit.vals object, we can get the raw estimated values for each age and height, as follows:

age.vals <- seq(from = 2, to = 85, by = 1)
height.vals <- seq(from =80, to = 200, by = 1)
predicted.weight <- predict(weight.fit, newdata = expand.grid(male.age = age.vals, male.height = height.vals))
persp(age.vals, height.vals, predicted.weight, theta = 40, xlab = 'Age', ylab = 'Height', zlab = 'Weight')

The result is as shown in the following plot:

Locally weighted polynomial regression and the loess function

We can adjust the view on the plot to see the effects of multiple dimensions by changing the theta argument of the persp command. In this case, a value of 40 seems to work well.

Generally, in science, we want to know more than the estimate of a value; we also want to have a confidence interval telling us the potential range of values within a given probability window (often 95 percent). The loess command is not able to do this with multiple predictors but the npreg command from the np package does have this capability.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.35.194