© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. MailundR 4 Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-8780-4_12

12. Working with Models: broom and modelr

Thomas Mailund1  
(1)
Aarhus, Denmark
 

There are many models to which you can fit your data, from classical statistical models to modern machine learning methods, and a thorough exploration of R packages that support this is well beyond the scope of this book. The main concerns when choosing and fitting models is not the syntax, and this book is, after all, a syntax reference. We will look at two packages that aim at making a tidy interface to models.

The two packages, broom and modelr, are not loaded with tidyverse, so you must load them individually.
library(broom)
library(modelr)

broom

When you fit a model, you get an object in return that holds information about the data and the fit. This data is represented in different ways—it depends on the implementation of the function used to fit the data. For a linear model, for example, we get this information :
model <- lm(disp ~ hp + wt, data = mtcars)
summary(model)
##
## Call:
## lm(formula = disp ~ hp + wt, data = mtcars)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -82.565 -23.802   2.111  35.731  99.107
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -129.9506    29.1890  -4.452 0.000116
## hp             0.6578     0.1649   3.990 0.000411
## wt            82.1125    11.5518   7.108 8.04e-08
##
## (Intercept) ***
## hp          ***
## wt          ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.35 on 29 degrees of freedom
## Multiple R-squared:  0.8635, Adjusted R-squared:  0.8541
## F-statistic: 91.71 on 2 and 29 DF,  p-value: 2.889e-13
The problem with this representation is that it can be difficult to extract relevant data because the data isn’t tidy. The broom package fixes this. It defines three generic functions, tidy(), glance(), and augment(). These functions return tibbles. The first gives you the fit, the second a summary of how good the fit is, and the third gives you the original data augmented with fit summaries.
tidy(model) # transform to tidy tibble
## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -130.       29.2       -4.45 1.16e-4
## 2 hp             0.658     0.165      3.99 4.11e-4
## 3 wt            82.1      11.6        7.11 8.04e-8
glance(model) # model summaries
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl>
## 1     0.863         0.854  47.3      91.7 2.89e-13
## # . . . with 7 more variables: df <dbl>,
## #   logLik <dbl>, AIC <dbl>, BIC <dbl>,
## #   deviance <dbl>, df.residual <int>, nobs <int>
augment(model) # add model info to data
## # A tibble: 32 × 10
##    .rownames        disp    hp    wt .fitted .resid
##    <chr>           <dbl> <dbl> <dbl>   <dbl>  <dbl>
##  1 Mazda RX4        160    110  2.62    158.   2.45
##  2 Mazda RX4 Wag    160    110  2.88    178. -18.5
##  3 Datsun 710       108     93  2.32    122. -13.7
##  4 Hornet 4 Drive   258    110  3.22    206.  51.6
##  5 Hornet Sporta. . .   360    175  3.44    268.  92.4
##  6 Valiant          225    105  3.46    223.   1.77
##  7 Duster 360       360    245  3.57    324.  35.6
##  8 Merc 240D        147.    62  3.19    173. -26.1
##  9 Merc 230         141.    95  3.15    191. -50.4
## 10 Merc 280         168.   123  3.44    233. -65.8
## # . . . with 22 more rows, and 4 more variables:
## #    .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
## #    .std.resid <dbl>

The broom package implements specializations for most models. Not all of the three functions are meaningful for all models, so some models only have a subset of the functions. If you want your own model to work with broom, let us call it mymodel, then you have to implement specializations of the functions, tidy.mymodel(), glance.mymodel(), and augment.mymodel(), that are relevant for the model.

modelr

The modelr package also provides functionality for fitting and inspecting models and for extracting information about model fits. We start with the latter.

Consider the following example model:
# Build a model where variable x can help us predict response y
dat <- tibble(
  x = runif(50),
  y = 15 * xˆ2 * x + 42 + rnorm(5)
)
# Fit a linear model to the data (even though y is quadratic in x)
model <- lm(y ~ x, data = dat)
tidy(model)
## # A tibble: 2 × 5
##   term       estimate std.error statistic p.value
##   <chr>         <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercep. . .    38.9     0.522      74.5 2.82e-51
## 2 x              15.1     0.829      18.2 3.67e-23

We have fitted a linear model to data that is sampled from a linear model. The details of the data and model do not matter, it is just an example. I have used the broom function tidy() to inspect the model.

Two functions, add_predictions() and add_residuals() , extend your data with predictions the model would make from each data row and the residuals for each row.
add_predictions(dat, model)
## # A tibble: 50 × 3
##        x     y  pred
##    <dbl> <dbl> <dbl>
##  1 0.633  45.7  48.5
##  2 0.385  43.4  44.7
##  3 0.566  46.1  47.4
##  4 0.922  53.5  52.8
##  5 0.976  56.0  53.6
##  6 0.933  54.1  53.0
##  7 0.381  43.4  44.7
##  8 0.256  43.7  42.8
##  9 0.257  42.0  42.8
## 10 0.197  42.2  41.9
## # . . . with 40 more rows
add_residuals(dat, model)
## # A tibble: 50 × 3
##        x     y  resid
##    <dbl> <dbl>  <dbl>
##  1 0.633  45.7 -2.76
##  2 0.385  43.4 -1.27
##  3 0.566  46.1 -1.32
##  4 0.922  53.5  0.690
##  5 0.976  56.0  2.37
##  6 0.933  54.1  1.10
##  7 0.381  43.4 -1.24
##  8 0.256  43.7  0.890
##  9 0.257  42.0 -0.797
## 10 0.197  42.2  0.281
## # . . . with 40 more rows
Predictions need not be for existing data. You can create a data frame of explanatory variables and predict the response variable from the new data.
new_dat <- tibble(x = seq(0, 1, length.out = 5))
add_predictions(new_dat, model)
## # A tibble: 5 × 2
##        x  pred
##    <dbl> <dbl>
## 1   0     38.9
## 2   0.25  42.7
## 3   0.5   46.5
## 4   0.75  50.2
## 5   1     54.0
I know that the x values are in the range from zero to one, but we cannot always a priori know the range that a variable falls within. If you don’t know, you can use the seq_range() function to get equidistant points in the range from the lowest to the highest value in your data.
seq(0, 1, length.out = 5)
## [1] 0.00 0.25 0.50 0.75 1.00
seq_range(dat$x, n = 5) # over the range of observations
## [1] 0.02272462 0.26437738 0.50603015 0.74768292
## [5] 0.98933569
If you have two models and want to know how they compare with respect to their predictions, you can use gather_predictions() and spread_predictions() :
# comparing line to a (better) model y ~ xˆ2 + x + 1
model2 <- lm(y ~ I(xˆ2) + x, data = dat)
gather_predictions(new_dat, model, model2)
## # A tibble: 10 × 3
##    model      x  pred
##    <chr>  <dbl> <dbl>
##  1 model   0     38.9
##  2 model   0.25  42.7
##  3 model   0.5   46.5
##  4 model   0.75  50.2
##  5 model   1     54.0
##  6 model2  0     43.2
##  7 model2  0.25  42.3
##  8 model2  0.5   44.2
##  9 model2  0.75  49.0
## 10 model2  1     56.8
spread_predictions(new_dat, model, model2)
## # A tibble: 5 × 3
##       x model model2
##   <dbl> <dbl>  <dbl>
## 1  0     38.9   43.2
## 2  0.25  42.7   42.3
## 3  0.5   46.5   44.2
## 4  0.75  50.2   49.0
## 5  1     54.0   56.8

They show the same data, just with the tables formatted differently. The names gather and spread resemble the tidyr functions pivot_longer and pivot_wider, but the names are taken from the functions gather and spread, also from tidyr, that are deprecated versions of the pivot functions.

Earlier, we made predictions on new data, but you can, of course, also do it on your original data.
gather_predictions(dat, model, model2)
## # A tibble: 100 × 4
##    model     x     y  pred
##    <chr> <dbl> <dbl> <dbl>
##  1 model 0.633  45.7  48.5
##  2 model 0.385  43.4  44.7
##  3 model 0.566  46.1  47.4
##  4 model 0.922  53.5  52.8
##  5 model 0.976  56.0  53.6
##  6 model 0.933  54.1  53.0
##  7 model 0.381  43.4  44.7
##  8 model 0.256  43.7  42.8
##  9 model 0.257  42.0  42.8
## 10 model 0.197  42.2  41.9
## # . . . with 90 more rows
spread_predictions(dat, model, model2)
## # A tibble: 50 × 4
##        x     y model model2
##    <dbl> <dbl> <dbl>  <dbl>
##  1 0.633  45.7  48.5   46.4
##  2 0.385  43.4  44.7   42.9
##  3 0.566  46.1  47.4   45.2
##  4 0.922  53.5  52.8   54.1
##  5 0.976  56.0  53.6   55.9
##  6 0.933  54.1  53.0   54.4
##  7 0.381  43.4  44.7   42.9
##  8 0.256  43.7  42.8   42.3
##  9 0.257  42.0  42.8   42.3
## 10 0.197  42.2  41.9   42.2
## # . . . with 40 more rows
If you have the original data, you can also get residuals.
gather_residuals(dat, model, model2)
## # A tibble: 100 × 4
##    model     x     y  resid
##    <chr> <dbl> <dbl>  <dbl>
##  1 model 0.633  45.7 -2.76
##  2 model 0.385  43.4 -1.27
##  3 model 0.566  46.1 -1.32
##  4 model 0.922  53.5  0.690
##  5 model 0.976  56.0  2.37
##  6 model 0.933  54.1  1.10
##  7 model 0.381  43.4 -1.24
##  8 model 0.256  43.7  0.890
##  9 model 0.257  42.0 -0.797
## 10 model 0.197  42.2  0.281
## # . . . with 90 more rows
spread_residuals(dat, model, model2)
## # A tibble: 50 × 4
##        x     y  model  model2
##    <dbl> <dbl>  <dbl>   <dbl>
##  1 0.633  45.7 -2.76  -0.713
##  2 0.385  43.4 -1.27   0.500
##  3 0.566  46.1 -1.32   0.938
##  4 0.922  53.5  0.690 -0.552
##  5 0.976  56.0  2.37   0.0826
##  6 0.933  54.1  1.10  -0.348
##  7 0.381  43.4 -1.24   0.505
##  8 0.256  43.7  0.890  1.39
##  9 0.257  42.0 -0.797 -0.275
## 10 0.197  42.2  0.281 -0.0587
## # . . . with 40 more rows

Depending on the type of data science you usually do, you might have to sample to get empirical distributions or to split your data into training and test data to avoid overfitting. With modelr, you have functions for this.

You can build n data sets using bootstrapping with the function bootstrap() .1
bootstrap(dat, n = 3)
## # A tibble: 3 × 2
##   strap               .id
##   <list>              <chr>
## 1 <resample [50 x 2]> 1
## 2 <resample [50 x 2]> 2
## 3 <resample [50 x 2]> 3

It samples random data points and creates n new data sets from this. The resulting tibble has two columns; the first, strap, contains the data for each sample and the second an id.

The crossv_mc() function —Monte Carlo cross-validation—creates cross-validation data, that is, it splits your data into training and test data. It creates n random data sets divided into test and training data.
crossv_mc(dat, n = 3)
## # A tibble: 3 × 3
##   train               test                .id
##   <list>              <list>              <chr>
## 1 <resample [39 x 2]> <resample [11 x 2]> 1
## 2 <resample [39 x 2]> <resample [11 x 2]> 2
## 3 <resample [39 x 2]> <resample [11 x 2]> 3

By default, the test data is 20% of the sampled data; you can change this using the test argument.

The crossv_kfold and crossv_loo gives you k-fold data and leave-one-out data, respectively.
crossv_kfold(dat, k = 3)
## # A tibble: 3 × 3
##   train               test                .id
##   <named list>        <named list>        <chr>
## 1 <resample [33 x 2]> <resample [17 x 2]> 1
## 2 <resample [33 x 2]> <resample [17 x 2]> 2
## 3 <resample [34 x 2]> <resample [16 x 2]> 3
crossv_loo(dat)
## # A tibble: 50 × 3
##    train               test                .id
##    <named list>        <named list>      <int>
##  1 <resample [49 x 2]> <resample [1 x 2]>    1
##  2 <resample [49 x 2]> <resample [1 x 2]>    2
##  3 <resample [49 x 2]> <resample [1 x 2]>    3
##  4 <resample [49 x 2]> <resample [1 x 2]>    4
##  5 <resample [49 x 2]> <resample [1 x 2]>    5
##  6 <resample [49 x 2]> <resample [1 x 2]>    6
##  7 <resample [49 x 2]> <resample [1 x 2]>    7
##  8 <resample [49 x 2]> <resample [1 x 2]>    8
##  9 <resample [49 x 2]> <resample [1 x 2]>    9
## 10 <resample [49 x 2]> <resample [1 x 2]>   10
## # . . . with 40 more rows
As an example, say you have sampled three bootstrap data sets . The samples are in the strap column, so we can map over it and fit a linear model to each sampled data set.
samples <- bootstrap(dat, 3)
fitted_models <- samples |>
  mutate(
    # Map over all the bootstrap samples and fit each of them
    fits = strap |> map((dat) lm(y ~ x, data = dat))
  )
fitted_models
## # A tibble: 3 × 3
##   strap               .id   fits
##   <list>              <chr> <list>
## 1 <resample [50 x 2]> 1     <lm>
## 2 <resample [50 x 2]> 2     <lm>
## 3 <resample [50 x 2]> 3     <lm>
fitted_models$fits[[1]]
##
## Call:
## lm(formula = y ~ x, data = dat)
##
## Coefficients:
## (Intercept)            x
##       38.46        15.54
Then we can map over the three models and inspect them using broom’s glance() function :
fitted_models$fits |>
  map(glance) |>
  bind_rows()
## # A tibble: 3 × 12
##   r.squared adj.r.squared sigma statistic  p.value
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl>
## 1     0.871         0.869  1.85      325. 5.07e-23
## 2     0.813         0.809  1.89      209. 4.12e-19
## 3     0.904         0.902  1.58      451. 4.63e-26
## # . . . with 7 more variables: df <dbl>,
## #   logLik <dbl>,  AIC <dbl>, BIC <dbl>,
## #   deviance <dbl>, df.residual <int>, nobs <int>
If we were interested in the empirical distribution of x, then we could extract the distribution over the bootstrapped data and go from there.
get_x <- function(m) {
  tidy(m) |> filter(term == "x") |>
             select(estimate) %>% as.double()
}
fitted_models$fits |> map_dbl(get_x)
## [1] 15.54085 13.25745 15.50705
If you want to compare models, rather than samples of your data, then modelr has support for that as well. You can make a list of the formulae you want to fit. The formulae() function lets you create such a list.
models <- formulae(~y, linear = ~x, quadratic = ~I(xˆ2) + x)

The first argument is the response variable (the left-hand side of the formulae), and the remaining arguments are (named) parameters that describe the explanatory variables, the right-hand part of a model formula.

If you call fit_with() with your data, the fitting function to use (here lm()), and the formulae you wish to fit, then you get what you want—a fit for each formula.
fits <- fit_with(dat, lm, models)
fits |> map(glance) |> bind_rows()
## # A tibble: 2 × 12
##   r.squared adj.r.squared sigma statistic   p.value
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl>
## 1     0.873         0.871 1.83       330.  3.67e-23
## 2     0.983         0.983 0.674      1379. 1.86e-42
## # . . . with 7 more variables: df <dbl>,
## #   logLik <dbl>,  AIC <dbl>, BIC <dbl>,
## #   deviance <dbl>, df.residual <int>, nobs <int>
You will find many model quality measures in modelr, for example, root mean square error :
fits |> map_dbl(rmse, data = dat)
##    linear quadratic
## 1.7973686 0.6533399
mean absolute error
fits |> map_dbl(mae, data = dat)
##    linear quadratic
## 1.5265158 0.5116268

and many more.

Since overfitting is always a problem, you might want to use a quality measure that at least attempts to take model complexity into account. You have some in the glance() function from broom.
fits |> map_dbl(~ glance(.x)$AIC)
##   linear quadratic
## 206.5262  107.3281
If at all possible, however, you want to use test data to measure how well a model generalizes. For this, you first need to fit your models to the training data and then make predictions on the test data. In the following example, I have fitted lm(y ~ x) on leave-one-out data and then applied it on the test data. I then measure the quality of the generalization using RMSE.
samples <- dat |> crossv_loo()
training_fits <- samples$train |> map(~lm(y ~ x, data = .))
training_fits |> map2_dbl(samples$test, rmse) |> head(10)
##         1         2         3         4         5
## 2.8223916 1.3057640 1.3454227 0.7254241 2.5129973
##         6         7         8         9        10
## 1.1552629 1.2763827 0.9248123 0.8273398 0.2941659
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.197.212