Generalized additive models

Generalized additive models (GAMs) are a non-parametric extension of GLMs in which the linear predictor depends linearly on unknown smooth functions of some predictor variables. GAMs are typically used to let the data "speak for themselves" since you don't need to specify the functional form of the relationship, the response, and the continuous explanatory variables beforehand. To fit your data to a GAM, you will need to obtain the gam() function from the mgcv package. It is similar to the glm() function except that you add s() to each explanatory variable you wish to add smooths. For example, if you wish to describe the relationship between y and to smooth three continuous explanatory variables x, z, and w, you would enter model <- gam(y ~ s(x) + s(z) + s(w)). Now let's go through a detailed example in R.

In this example, we will explore the relationship between the total number of pregnancies and a variety of measurements taken from 300 mice.

Let's first simulate a dataset for 300 mice with pregnancies, glucose, pressure, insulin, and weight as explanatory variables. Since the sample() and rnorm() functions generate pseudorandom numbers, you will get different results when you run the following code in R. Let's simulate our explanatory variables with the following lines of code:

> pregnancies <- sample(0:25, 300,replace=T)
> glucose <- sample(65:200, 300,replace=T)
> pressure <-  sample(50:120, 300,replace=T)
> insulinD <- abs(rnorm(150, 450, 100))
> insulinN <- abs(rnorm(150, 65, 75))
> insulin <- c(insulinD, insulinN)
> weight <- sample(20:70, 300,replace=T)

Now let's use the gam() function part of the mgcv package to explore the relationship between diabetes and pregnancies, glucose, pressure, insulin, and weight as follows:

> library("mgcv")
> mouse.data.gam <- gam(pregnancies ~ s(glucose) + s(pressure) + s(insulin) + s(weight))

> summary(mouse.data.gam)

Family: gaussian 
Link function: identity 

Formula:
pregnancies ~ s(glucose) + s(pressure) + s(insulin) + s(weight)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.3467     0.4149   32.17   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df     F p-value
s(glucose)  3.225  3.995 1.270   0.282
s(pressure) 1.088  1.171 0.216   0.681
s(insulin)  1.000  1.000 1.080   0.300
s(weight)   1.000  1.000 1.562   0.212

R-sq.(adj) =  0.0118   Deviance explained = 3.26%
GCV = 52.936  Scale est. = 51.646    n = 300

As you can see from the summary results of the GAM model, R returns the effective degrees of freedom (edf) and other useful information in assessing the fit of our data to the GAM model. Unlike other ordinary least squares linear regression models, the number of degrees of freedom is not equivalent to the number of predictors in the model because it also needs to take into account the smoothing process. Therefore, the terms associated with the smoothing process are each penalized to a certain extent, which in turn is reflected in the edf value.

Another important term to consider when evaluating the fit of a GAM is the generalized cross validation (GCV) score, which is used as an estimate of the mean square prediction error. We can use the GCV as a comparative measure to choose between different models, where the lower the GCV value, the better the model. To read more about the information returned in the gam object summary, you can consult the documentation page as follows:

> ?summary.gam

To plot the model to inspect 95 percent Bayesian confidence intervals to determine whether the curvature is real or not, we can use the following function:

> par(mfrow=c(2,2))
> plot(mouse.data.gam)

The result is shown in the following plot:

Generalized additive models

Alternatively, we can use the vis.gam() function to visualize the two main effect terms in a perspective plot. In this case, the two main effect terms are glucose and pressure as follows:

> par(mfrow=c(1,1))
> vis.gam(mouse.data.gam,theta=-35,color="topo")

The result is shown in the following plot:

Generalized additive models

We can also use the gam.check() function to inspect our model as follows

> gam.check(mouse.data.gam) 

Method: GCV   Optimizer: magic
Smoothing parameter selection converged after 13 iterations.
The RMS GCV score gradiant at convergence was 1.813108e-06 .
The Hessian was positive definite.
The estimated model rank was 37 (maximum possible: 37)
Model rank =  37 / 37 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

               k'   edf k-index p-value
s(glucose)  9.000 3.225   1.030    0.70
s(pressure) 9.000 1.088   0.999    0.44
s(insulin)  9.000 1.000   1.106    0.96
s(weight)   9.000 1.000   1.058    0.82

Let's see what happens to the fit of the model when we remove the non-significant explanatory variables as follows:

> mouse.data.gam2 <- gam(pregnancies ~ s(insulin))
> summary(mouse.data.gam2)

Family: gaussian 
Link function: identity 

Formula:
pregnancies ~ s(insulin)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.3467     0.4174   31.97   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
           edf Ref.df     F p-value
s(insulin)   1      1 0.918   0.339

R-sq.(adj) =  -0.000276   Deviance explained = 0.307%
GCV = 52.626  Scale est. = 52.275    n = 300

To better compare the two models, we can use the AIC() function, which calculates the Akaike information criterion (AIC) to give us an idea of the trade-off between the goodness of fit of the model and the complexity of the model. Generally speaking, a good GAM has a low AIC and and the fewest number of degrees of freedom. Let's take a look at this in the following lines of code:

>  AIC(mouse.data.gam, mouse.data.gam2)
                      df      AIC
mouse.data.gam  8.313846 2045.909
mouse.data.gam2 3.000000 2044.313

As you can see, the number of degrees of freedom and the AIC value are slightly lower for mouse.data.gam2, making the second GAM a better model than the first. That being said, GAMs are best used in the preliminary phase of your analysis to broadly explore the potential relationship between your response and explanatory variables. Some people find it useful to examine the shape of a curve with GAMs and then reconstruct the curve shape parametrically with GLMs for model building. A GLM is preferable to the more complex GAM. As a general rule of thumb, it is usually preferable to rely on a simple well-understood model to predict future cases than on a complex model that is difficult to interpret and summarize.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.134.198