Regression with MARS

To do this part of the process, we will build a model with the earth package, review it on the training data, then see how it performs on the test data. We'll run a 10-fold cross-validation with the algorithm: 

> set.seed(1492)

> earth_fit <-
earth::earth(
x = pca_scores[, 1:5],
y = pca_scores[, 6],
pmethod = 'cv',
nfold = 10,
degree = 1,
minspan = -1
)

Calling the summary of the model object gives us seven total terms with three of the features:

> summary(earth_fit)
Call: earth(x=pca_scores[,1:5], y=pca_scores[,6], pmethod="cv", degree=1, nfold=10,
minspan=-1)

coefficients
(Intercept) 174.182
h(0.1-PC1) -26.380
h(PC1-0.1) 33.806
h(0.01-PC2) -13.181
h(PC2-0.01) 13.842
h(0.02-PC5) 1.333
h(PC5-0.02) -0.869

Selected 7 of 7 terms, and 3 of 5 predictors using pmethod="cv"
Termination condition: RSq changed by less than 0.001 at 7 terms
Importance: PC1, PC2, PC5, PC3-unused, PC4-unused
Number of terms at each degree of interaction: 1 6 (additive model)
GRSq 0.9518 RSq 0.952 mean.oof.RSq 0.9512 (sd 0.0151)

pmethod="backward" would have selected the same model:
7 terms 3 preds, GRSq 0.9518 RSq 0.952 mean.oof.RSq 0.9512

The model achieved a tremendous r-squared of 0.952 with components 1, 2, and 5. It can be a little easier to see the hinge functions at play with plotmo:

> plotmo::plotmo(earth_fit)

The output of the preceding code is as follows:

It's kind of a challenge to discern those subtle hinge functions from plotmo, with the exception of PC1. To see how this model really performs, save the predicted values and run some plots:

> ggplot2::ggplot(pca_scores, ggplot2::aes(x = earthpred, y = weight)) +
ggplot2::geom_point() +
ggplot2::stat_smooth(method = "lm", se = FALSE) +
ggthemes::theme_pander()

The output of the preceding code is as follows:

We see a nice linear relationship, but we have several outliers that make us scratch our heads. You mean our model predicts a weight of almost 225 pounds, but the soldier is less than 125 pounds? Something isn't right with those outlier predictions, perhaps measurement or data entry error; they are interesting observations nonetheless, worthy of further investigation, time permitting.

How about the residuals?

> ggplot2::ggplot(pca_scores, ggplot2::aes(x = earthpred, y = earthresid)) +
ggplot2::geom_point() +
ggplot2::stat_smooth(method = "loess", se = FALSE) +
ggthemes::theme_few()

The output of the preceding code is as follows:

Just the slightest curvilinear relationship. We are seeing that the algorithm is underestimating, minimally, soldiers' weight at the extreme values. We already have r-squared, but RMSE and MAE are quickly callable:

> caret::postResample(pred = pca_scores$earthpred, 
obs = pca_scores$weight)
RMSE Rsquared MAE
7.336 0.952 5.219

The mean absolute error is just 5 percent. Let's see if this holds on the test data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.3.167