There are many models to which you can fit your data, from classical statistical models to modern machine learning methods, and a thorough exploration of R packages that support this is well beyond the scope of this book. The main concerns when choosing and fitting models is not the syntax, and this book is, after all, a syntax reference. We will look at two packages that aim at making a tidy interface to models.
broom
When you fit a model, you get an object in return that holds information about the data and the fit. This data is represented in different ways—it depends on the implementation of the function used to fit the data. For a linear model, for example, we get this
information
:
model <- lm(disp ~ hp + wt, data = mtcars)
summary(model)
##
## Call:
## lm(formula = disp ~ hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.565 -23.802 2.111 35.731 99.107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -129.9506 29.1890 -4.452 0.000116
## hp 0.6578 0.1649 3.990 0.000411
## wt 82.1125 11.5518 7.108 8.04e-08
##
## (Intercept) ***
## hp ***
## wt ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.35 on 29 degrees of freedom
## Multiple R-squared: 0.8635, Adjusted R-squared: 0.8541
## F-statistic: 91.71 on 2 and 29 DF, p-value: 2.889e-13
The problem with this representation is that it can be difficult to extract relevant data because the data isn’t tidy. The
broom package fixes this. It defines three generic functions,
tidy(),
glance(), and
augment().
These functions return tibbles. The first gives you the fit, the second a summary of how good the fit is, and the third gives you the original data augmented with fit summaries.
tidy(model) # transform to tidy tibble
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -130. 29.2 -4.45 1.16e-4
## 2 hp 0.658 0.165 3.99 4.11e-4
## 3 wt 82.1 11.6 7.11 8.04e-8
glance(model) # model summaries
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.863 0.854 47.3 91.7 2.89e-13
## # . . . with 7 more variables: df <dbl>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>,
## # deviance <dbl>, df.residual <int>, nobs <int>
augment(model) # add model info to data
## # A tibble: 32 × 10
## .rownames disp hp wt .fitted .resid
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 160 110 2.62 158. 2.45
## 2 Mazda RX4 Wag 160 110 2.88 178. -18.5
## 3 Datsun 710 108 93 2.32 122. -13.7
## 4 Hornet 4 Drive 258 110 3.22 206. 51.6
## 5 Hornet Sporta. . . 360 175 3.44 268. 92.4
## 6 Valiant 225 105 3.46 223. 1.77
## 7 Duster 360 360 245 3.57 324. 35.6
## 8 Merc 240D 147. 62 3.19 173. -26.1
## 9 Merc 230 141. 95 3.15 191. -50.4
## 10 Merc 280 168. 123 3.44 233. -65.8
## # . . . with 22 more rows, and 4 more variables:
## # .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
## # .std.resid <dbl>
The broom package implements specializations for most models. Not all of the three functions are meaningful for all models, so some models only have a subset of the functions. If you want your own model to work with broom, let us call it mymodel, then you have to implement specializations of the functions, tidy.mymodel(), glance.mymodel(), and augment.mymodel(), that are relevant for the model.
modelr
The modelr package
also provides functionality for fitting and inspecting models and for extracting information about model fits. We start with the latter.
Consider the following example model:
# Build a model where variable x can help us predict response y
dat <- tibble(
x = runif(50),
y = 15 * xˆ2 * x + 42 + rnorm(5)
)
# Fit a linear model to the data (even though y is quadratic in x)
model <- lm(y ~ x, data = dat)
tidy(model)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercep. . . 38.9 0.522 74.5 2.82e-51
## 2 x 15.1 0.829 18.2 3.67e-23
We have fitted a linear model to data that is sampled from a linear model. The details of the data and model do not matter, it is just an example. I have used the broom function tidy() to inspect the model.
Two functions,
add_predictions() and
add_residuals()
, extend your data with predictions the model would make from each data row and the residuals for each row.
add_predictions(dat, model)
## # A tibble: 50 × 3
## x y pred
## <dbl> <dbl> <dbl>
## 1 0.633 45.7 48.5
## 2 0.385 43.4 44.7
## 3 0.566 46.1 47.4
## 4 0.922 53.5 52.8
## 5 0.976 56.0 53.6
## 6 0.933 54.1 53.0
## 7 0.381 43.4 44.7
## 8 0.256 43.7 42.8
## 9 0.257 42.0 42.8
## 10 0.197 42.2 41.9
## # . . . with 40 more rows
add_residuals(dat, model)
## # A tibble: 50 × 3
## x y resid
## <dbl> <dbl> <dbl>
## 1 0.633 45.7 -2.76
## 2 0.385 43.4 -1.27
## 3 0.566 46.1 -1.32
## 4 0.922 53.5 0.690
## 5 0.976 56.0 2.37
## 6 0.933 54.1 1.10
## 7 0.381 43.4 -1.24
## 8 0.256 43.7 0.890
## 9 0.257 42.0 -0.797
## 10 0.197 42.2 0.281
## # . . . with 40 more rows
Predictions need not be for existing data. You can create a
data frame
of explanatory variables and predict the response variable from the new data.
new_dat <- tibble(x = seq(0, 1, length.out = 5))
add_predictions(new_dat, model)
## # A tibble: 5 × 2
## x pred
## <dbl> <dbl>
## 1 0 38.9
## 2 0.25 42.7
## 3 0.5 46.5
## 4 0.75 50.2
## 5 1 54.0
I know that the
x values are in the range from zero to one, but we cannot always a priori know the range that a variable falls within. If you don’t know, you can use the
seq_range() function
to get equidistant points in the range from the lowest to the highest value in your data.
seq(0, 1, length.out = 5)
## [1] 0.00 0.25 0.50 0.75 1.00
seq_range(dat$x, n = 5) # over the range of observations
## [1] 0.02272462 0.26437738 0.50603015 0.74768292
## [5] 0.98933569
If you have two models and want to know how they compare with respect to their predictions, you can use
gather_predictions() and
spread_predictions()
:
# comparing line to a (better) model y ~ xˆ2 + x + 1
model2 <- lm(y ~ I(xˆ2) + x, data = dat)
gather_predictions(new_dat, model, model2)
## # A tibble: 10 × 3
## model x pred
## <chr> <dbl> <dbl>
## 1 model 0 38.9
## 2 model 0.25 42.7
## 3 model 0.5 46.5
## 4 model 0.75 50.2
## 5 model 1 54.0
## 6 model2 0 43.2
## 7 model2 0.25 42.3
## 8 model2 0.5 44.2
## 9 model2 0.75 49.0
## 10 model2 1 56.8
spread_predictions(new_dat, model, model2)
## # A tibble: 5 × 3
## x model model2
## <dbl> <dbl> <dbl>
## 1 0 38.9 43.2
## 2 0.25 42.7 42.3
## 3 0.5 46.5 44.2
## 4 0.75 50.2 49.0
## 5 1 54.0 56.8
They show the same data, just with the tables formatted differently. The names gather and spread resemble the tidyr functions
pivot_longer and pivot_wider, but the names are taken from the functions gather and spread, also from tidyr, that are deprecated versions of the pivot functions.
Earlier, we made predictions on new data, but you can, of course, also do it on your original data.
gather_predictions(dat, model, model2)
## # A tibble: 100 × 4
## model x y pred
## <chr> <dbl> <dbl> <dbl>
## 1 model 0.633 45.7 48.5
## 2 model 0.385 43.4 44.7
## 3 model 0.566 46.1 47.4
## 4 model 0.922 53.5 52.8
## 5 model 0.976 56.0 53.6
## 6 model 0.933 54.1 53.0
## 7 model 0.381 43.4 44.7
## 8 model 0.256 43.7 42.8
## 9 model 0.257 42.0 42.8
## 10 model 0.197 42.2 41.9
## # . . . with 90 more rows
spread_predictions(dat, model, model2)
## # A tibble: 50 × 4
## x y model model2
## <dbl> <dbl> <dbl> <dbl>
## 1 0.633 45.7 48.5 46.4
## 2 0.385 43.4 44.7 42.9
## 3 0.566 46.1 47.4 45.2
## 4 0.922 53.5 52.8 54.1
## 5 0.976 56.0 53.6 55.9
## 6 0.933 54.1 53.0 54.4
## 7 0.381 43.4 44.7 42.9
## 8 0.256 43.7 42.8 42.3
## 9 0.257 42.0 42.8 42.3
## 10 0.197 42.2 41.9 42.2
## # . . . with 40 more rows
If you have the original data, you can also get residuals.
gather_residuals(dat, model, model2)
## # A tibble: 100 × 4
## model x y resid
## <chr> <dbl> <dbl> <dbl>
## 1 model 0.633 45.7 -2.76
## 2 model 0.385 43.4 -1.27
## 3 model 0.566 46.1 -1.32
## 4 model 0.922 53.5 0.690
## 5 model 0.976 56.0 2.37
## 6 model 0.933 54.1 1.10
## 7 model 0.381 43.4 -1.24
## 8 model 0.256 43.7 0.890
## 9 model 0.257 42.0 -0.797
## 10 model 0.197 42.2 0.281
## # . . . with 90 more rows
spread_residuals(dat, model, model2)
## # A tibble: 50 × 4
## x y model model2
## <dbl> <dbl> <dbl> <dbl>
## 1 0.633 45.7 -2.76 -0.713
## 2 0.385 43.4 -1.27 0.500
## 3 0.566 46.1 -1.32 0.938
## 4 0.922 53.5 0.690 -0.552
## 5 0.976 56.0 2.37 0.0826
## 6 0.933 54.1 1.10 -0.348
## 7 0.381 43.4 -1.24 0.505
## 8 0.256 43.7 0.890 1.39
## 9 0.257 42.0 -0.797 -0.275
## 10 0.197 42.2 0.281 -0.0587
## # . . . with 40 more rows
Depending on the type of data science you usually do, you might have to sample to get empirical distributions or to split your data into training and test data to avoid overfitting. With modelr, you have functions for this.
You can build
n data sets using bootstrapping with the function
bootstrap()
.
1
## # A tibble: 3 × 2
## strap .id
## <list> <chr>
## 1 <resample [50 x 2]> 1
## 2 <resample [50 x 2]> 2
## 3 <resample [50 x 2]> 3
It samples random data points and creates n new data sets from this. The resulting tibble has two columns; the first, strap, contains the data for each sample and the second an id.
The
crossv_mc() function
—Monte Carlo cross-validation—creates cross-validation data, that is, it splits your data into training and test data. It creates
n random data sets divided into test and training data.
## # A tibble: 3 × 3
## train test .id
## <list> <list> <chr>
## 1 <resample [39 x 2]> <resample [11 x 2]> 1
## 2 <resample [39 x 2]> <resample [11 x 2]> 2
## 3 <resample [39 x 2]> <resample [11 x 2]> 3
By default, the test data is 20% of the sampled data; you can change this using the test argument.
The
crossv_kfold and
crossv_loo gives you k-fold data and leave-one-out data, respectively.
## # A tibble: 3 × 3
## train test .id
## <named list> <named list> <chr>
## 1 <resample [33 x 2]> <resample [17 x 2]> 1
## 2 <resample [33 x 2]> <resample [17 x 2]> 2
## 3 <resample [34 x 2]> <resample [16 x 2]> 3
## # A tibble: 50 × 3
## train test .id
## <named list> <named list> <int>
## 1 <resample [49 x 2]> <resample [1 x 2]> 1
## 2 <resample [49 x 2]> <resample [1 x 2]> 2
## 3 <resample [49 x 2]> <resample [1 x 2]> 3
## 4 <resample [49 x 2]> <resample [1 x 2]> 4
## 5 <resample [49 x 2]> <resample [1 x 2]> 5
## 6 <resample [49 x 2]> <resample [1 x 2]> 6
## 7 <resample [49 x 2]> <resample [1 x 2]> 7
## 8 <resample [49 x 2]> <resample [1 x 2]> 8
## 9 <resample [49 x 2]> <resample [1 x 2]> 9
## 10 <resample [49 x 2]> <resample [1 x 2]> 10
## # . . . with 40 more rows
As an example, say you have sampled three
bootstrap data sets
. The samples are in the
strap column, so we can map over it and fit a linear model to each sampled data set.
samples <- bootstrap(dat, 3)
fitted_models <- samples |>
mutate(
# Map over all the bootstrap samples and fit each of them
fits = strap |> map((dat) lm(y ~ x, data = dat))
)
fitted_models
## # A tibble: 3 × 3
## strap .id fits
## <list> <chr> <list>
## 1 <resample [50 x 2]> 1 <lm>
## 2 <resample [50 x 2]> 2 <lm>
## 3 <resample [50 x 2]> 3 <lm>
##
## Call:
## lm(formula = y ~ x, data = dat)
##
## Coefficients:
## (Intercept) x
## 38.46 15.54
Then we can map over the three models and inspect them using
broom’s
glance() function
:
fitted_models$fits |>
map(glance) |>
bind_rows()
## # A tibble: 3 × 12
## r.squared adj.r.squared sigma statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.871 0.869 1.85 325. 5.07e-23
## 2 0.813 0.809 1.89 209. 4.12e-19
## 3 0.904 0.902 1.58 451. 4.63e-26
## # . . . with 7 more variables: df <dbl>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>,
## # deviance <dbl>, df.residual <int>, nobs <int>
If we were interested in the empirical distribution of
x, then we could extract the distribution over the bootstrapped data and go from there.
get_x <- function(m) {
tidy(m) |> filter(term == "x") |>
select(estimate) %>% as.double()
}
fitted_models$fits |> map_dbl(get_x)
## [1] 15.54085 13.25745 15.50705
If you want to compare models, rather than samples of your data, then
modelr has support for that as well. You can make a list of the formulae you want to fit. The
formulae() function
lets you create such a list.
models <- formulae(~y, linear = ~x, quadratic = ~I(xˆ2) + x)
The first argument is the response variable (the left-hand side of the formulae), and the remaining arguments are (named) parameters that describe the explanatory variables, the right-hand part of a model formula.
If you call
fit_with()
with your data, the fitting function to use (here
lm()), and the formulae you wish to fit, then you get what you want—a fit for each formula.
fits <- fit_with(dat, lm, models)
fits |> map(glance) |> bind_rows()
## # A tibble: 2 × 12
## r.squared adj.r.squared sigma statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.873 0.871 1.83 330. 3.67e-23
## 2 0.983 0.983 0.674 1379. 1.86e-42
## # . . . with 7 more variables: df <dbl>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>,
## # deviance <dbl>, df.residual <int>, nobs <int>
You will find many model quality measures in
modelr, for example,
root mean square error
:
fits |> map_dbl(rmse, data = dat)
## linear quadratic
## 1.7973686 0.6533399
mean absolute error
fits |> map_dbl(mae, data = dat)
## linear quadratic
## 1.5265158 0.5116268
and many more.
Since overfitting is always a problem, you might want to use a quality measure that at least attempts to take model complexity into account. You have some in the
glance() function
from
broom.
fits |> map_dbl(~ glance(.x)$AIC)
## linear quadratic
## 206.5262 107.3281
If at all possible, however, you want to use test data to measure how well a model generalizes. For this, you first need to fit your models to the training data and then make predictions on the test data. In the following example, I have fitted
lm(y ~ x) on leave-one-out data and then applied it on the test data. I then measure the quality of the
generalization
using RMSE.
samples <- dat |> crossv_loo()
training_fits <- samples$train |> map(~lm(y ~ x, data = .))
training_fits |> map2_dbl(samples$test, rmse) |> head(10)
## 1 2 3 4 5
## 2.8223916 1.3057640 1.3454227 0.7254241 2.5129973
## 6 7 8 9 10
## 1.1552629 1.2763827 0.9248123 0.8273398 0.2941659