© Thomas Mailund 2019
Thomas MailundR Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-4894-2_11

11. Working with Models: broom and modelr

Thomas Mailund1 
(1)
Aarhus, Denmark
 

There are many models to which you can fit your data, from classical statistical models to modern machine learning methods, and a thorough exploration of R packages that support this is well beyond the scope of this book. The main concern when choosing and fitting models is not the syntax, and this book is, after all, a syntax reference. We will look at two packages that aim at making a tidy interface to models.

The two packages, broom and modelr, are not loaded with tidyverse so you must load them individually.
library(broom)
library(modelr)

broom

When you fit a model, you get an object in return that holds information about the data and the fit. This data is represented in different ways—it depends on the implementation of the function used to fit the data. For a linear model, for example, we get this information:
model <- lm(disp ~ hp + wt, data = mtcars)
summary(model)
##
## Call:
## lm(formula = disp ~ hp + wt, data = mtcars)
##
## Residuals:
##     Min      1Q Median    3Q    Max
## -82.565 -23.802 2.111 35.731 99.107
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -129.9506    29.1890  -4.452 0.000116
## hp             0.6578     0.1649   3.990 0.000411
## wt            82.1125    11.5518   7.108 8.04e-08
##
## (Intercept) ***
## hp          ***
## wt          ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.35 on 29 degrees of freedom
## Multiple R-squared: 0.8635, Adjusted R-squared:   0.8541
## F-statistic: 91.71 on 2 and 29 DF, p-value: 2.889e-13
The problem with this representation is that it can be difficult to extract relevant data because the data isn’t tidy. The broom package fixes this. It defines three generic function, tidy() , glance(), and augment(). These functions return tibbles. The first gives you the fit, the second a summary of how good the fit is, and the third gives you the original data augmented with fit summaries.
tidy(model) # transform to tidy tibble
## # A tibble: 3 x 5
##   term      estimate std.error statistic   p.value
##   <chr>        <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Interce...  -130.      29.2     -4.45   1.16e-4
## 2 hp              0.658    0.165    3.99   4.11e-4
## 3 wt             82.1     11.6      7.11   8.04e-8
glance(model) # model summaries
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic    p.value
##       <dbl>         <dbl> <dbl>     <dbl>      <dbl>
## 1     0.863         0.854  47.3      91.7   2.89e-13
## # ...  with 6 more variables: df <int>,
## #  logLik <dbl>, AIC <dbl>, BIC <dbl>,
## #  deviance <dbl>, df.residual <int>
augment(model) # add model info to data
## # A tibble: 32 x 11
##    .rownames    disp    hp     wt    .fitted  .se.fit
##    <chr>       <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
##  1 Mazda RX4     160     110   2.62     158.     9.96
##  2 Mazda RX. . . 160     110   2.88     178.     9.53
##  3 Datsun 7. . . 108      93   2.32     122.    11.6
##  4 Hornet 4. . . 258     110   3.22     206.    10.3
##  5 Hornet S. . . 360     175   3.44     268.     9.09
##  6 Valiant       225     105   3.46     223.    12.3
##  7 Duster 3. . . 360     245   3.57     324.    16.2
##  8 Merc 240D     147.     62   3.19     173.    16.1
##  9 Merc 230      141.     95   3.15     191.    11.6
## 10 Merc 280      168.    123   3.44     233.    10.3
## # . . . with 22 more rows, and 5 more variables:
## #   .resid <dbl>, .hat <dbl>, .sigma <dbl>,
## #   .cooksd <dbl>, .std.resid <dbl>

The broom package implements specializations for most models. Not all of the three functions are meaningful for all models, so some models only have a subset of the functions. If you want your own model to work with broom, let us call it mymodel, then you have to implement specializations of the functions, tidy.mymodel(), glance.mymodel(), and augment.mymodel(), that are relevant for the model.

modelr

The modelr package also provides functionality for fitting and inspecting models and for extracting information about model fits. We start with the latter.

Consider the following example model.
x <- tibble(
  x = runif(5),
  y = 15 *12 * x + 42 + rnorm(5)
)
model <- lm(y ~ x, data = x)
tidy(model)
## # A tibble: 2 x 5
##   term      estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Interce. . .   41.8    0.271      154.   5.99e-7
## 2 x              180.     1.15       157.   5.72e-7

We have fitted a linear model to data that is sampled from a linear model. The details of the data and model do not matter; it is just an example. I have used the broom function tidy() to inspect the model.

Two functions, add_predictions() and add_residuals() , extend your data with predictions the model would make from each data row and the residuals for each row.
add_predictions(x, model)
## # A tibble: 5 x 3
##        x     y  pred
##    <dbl> <dbl> <dbl>
## 1 0.445  122.  122.
## 2 0.0594  52.9  52.5
## 3 0.275   91.3  91.4
## 4 0.0311  46.9  47.5
## 5 0.0145  44.7  44.4
add_residuals(x, model)
## # A tibble: 5 x 3
##        x     y    resid
##    <dbl> <dbl>    <dbl>
## 1 0.445  122.    0.0653
## 2 0.0594  52.9   0.399
## 3 0.275   91.3  -0.140
## 4 0.0311  46.9  -0.566
## 5 0.0145  44.7   0.242
Predictions need not be for existing data. You can create a data frame of explanatory variables and predict the response variable from the new data.
xs <- tibble(x = seq(0, 1, length.out = 5))
add_predictions(xs, model)
## # A tibble: 5 x 2
##       x  pred
##   <dbl> <dbl>
## 1  0     41.8
## 2  0.25  86.9
## 3  0.5  132.
## 4  0.75 177.
## 5  1    222.
I know that the x values are in the range from zero to one, but we cannot always a priori know the range that a variable falls within. If you don’t know, you can use the seq_range() function to get equidistant points in the range from the lowest to the highest value in your data.
seq(0, 1, length.out = 5)
## [1] 0.00 0.25 0.50 0.75 1.00
seq_range(x$x, n = 5) # over the range of observations
## [1] 0.01448234 0.12204440 0.22960645 0.33716850
## [5] 0.44473056
If you have two models and want to know how they compare with respect to their predictions, you can use gather_predictions() and spread_predictions():
# comparing models
model2 <- lm(y ~ I(x^2) + x, data = x)
gather_predictions(xs, model, model2)
## # A tibble: 10 x 3
##    model      x  pred
##    <chr>  <dbl> <dbl>
##  1 model   0     41.8
##  2 model   0.25  86.9
##  3 model   0.5  132.
##  4 model   0.75 177.
##  5 model   1    222.
##  6 model2  0     41.9
##  7 model2  0.25  86.8
##  8 model2  0.5  132.
##  9 model2  0.75 178.
## 10 model2  1    223.
spread_predictions(xs, model, model2)
## # A tibble: 5 x 3
##       x  model model2
##   <dbl>  <dbl>  <dbl>
## 1   0     41.8   41.9
## 2   0.25  86.9   86.8
## 3   0.5  132.   132.
## 4   0.75 177.   178.
## 5   1    222.   223.

They show the same data, just with the tables formatted differently. The names gather and spread resembles the tidyr functions gather and spread .

In the previous example, we made predictions on new data, but you can, of course, also do it on your original data.
gather_predictions(x, model,   model2)
## # A tibble: 10 x 4
##    model       x        y    pred
##    <chr>   <dbl>    <dbl>   <dbl>
##  1 model    0.445    122.    122.
##  2 model    0.0594    52.9    52.5
##  3 model    0.275     91.3    91.4
##  4 model    0.0311    46.9    47.5
##  5 model    0.0145    44.7    44.4
##  6 model2   0.445    122.    122.
##  7 model2   0.0594    52.9    52.5
##  8 model2   0.275     91.3    91.3
##  9 model2   0.0311    46.9    47.5
## 10 model2   0.0145    44.7    44.5
spread_predictions(x, model,  model2)
## # A tibble:   5 x 4
##        x       y model model2
##    <dbl>   <dbl> <dbl>  <dbl>
## 1 0.445    122.  122.   122.
## 2 0.0594    52.9  52.5   52.5
## 3 0.275     91.3  91.4   91.3
## 4 0.0311    46.9  47.5   47.5
## 5 0.0145    44.7  44.4   44.5
If you have the original data, you can also get residuals.
gather_residuals(x, model, model2)
## # A tibble: 10 x 4
##    model       x      y     resid
##    <chr>   <dbl>  <dbl>     <dbl>
##  1 model  0.445   122.     0.0653
##  2 model  0.0594   52.9    0.399
##  3 model  0.275    91.3   -0.140
##  4 model  0.0311   46.9   -0.566
##  5 model  0.0145   44.7    0.242
##  6 model2 0.445   122.     0.0225
##  7 model2 0.0594   52.9    0.412
##  8 model2 0.275    91.3   -0.0710
##  9 model2 0.0311   46.9   -0.578
## 10 model2 0.0145   44.7    0.215
spread_residuals(x, model, model2)
## # A tibble: 5 x 4
##        x     y      model  model2
##    <dbl> <dbl>      <dbl>    <dbl>
## 1 0.445  122.      0.0653   0.0225
## 2 0.0594  52.9     0.399    0.412
## 3 0.275   91.3    -0.140   -0.0710
## 4 0.0311  46.9    -0.566   -0.578
## 5 0.0145  44.7     0.242    0.215

Depending on the type of data science you usually do, you might have to sample to get empirical distributions or to split your data into training and test data to avoid overfitting. With modelr you have functions for this.

You can build n data sets using bootstrapping with the function bootstrap() .1
bootstrap(x, n = 3)
## # A tibble: 3 x 2
##   strap      .id
##   <list>     <chr>
## 1 <resample> 1
## 2 <resample> 2
## 3 <resample> 3

It samples random data points and creates n new data sets from this. The resulting tibble has two columns: the first, strap, contains the data for each sample and the second an id.

The crossv_mc() function—Monte Carlo cross-validation—creates cross-validation data, that is, it splits your data into training and test data. It creates n random data sets divided into test and training data.
crossv_mc(x, n = 3)
## # A tibble: 3 x 3
##   train      test       .id
##   <list>     <list>     <chr>
## 1 <resample> <resample> 1
## 2 <resample> <resample> 2
## 3 <resample> <resample> 3

By default, the test data is 20% of the sampled data; you can change this using the test argument.

The crossv_kfold and crossv_loo give you k-fold data and leave-one-out data, respectively.
crossv_kfold(x, k = 3)
## # A tibble: 3 x 3
##   train      test       .id
##   <list>     <list>     <chr>
## 1 <resample> <resample> 1
## 2 <resample> <resample> 2
## 3 <resample> <resample> 3
crossv_loo(x)
## # A tibble: 5 x 3
##   train      test         .id
##   <list>     <list>     <int>
## 1 <resample> <resample>    1
## 2 <resample> <resample>    2
## 3 <resample> <resample>    3
## 4 <resample> <resample>    4
## 5 <resample> <resample>    5
As an example, say you have sampled three bootstrap data sets. The samples are in the strap column, so we can map over it and fit a linear model to each sampled data set.
samples <- bootstrap(x, 3)
fitted_models <- samples$strap %>%
  map(~ lm(y ~ x, data = .))
Then we can map over the three models and inspect them using broom's glance() function :
fitted_models %>%
  map(glance) %>%
  bind_rows()
## # A tibble: 3 x 11
##   r.squared adj.r.squared sigma statistic  p.value
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl>
## 1     1.000         1.000 0.171   160710.  3.42e-8
## 2     1.000         1.000 0.513    16835.  1.01e-6
## 3     1.000         1.000 0.485    17065.  9.89e-7
## # . . . with 6 more variables: df <int>,
## #   logLik <dbl>, AIC <dbl>, BIC <dbl>,
## #   deviance <dbl>, df.residual <int>
If we were interested in the empirical distribution of x, then we could extract the distribution over the bootstrapped data and go from there.
get_x <- function(m) {
  tidy(m) %>% filter(term == "x") %>%
    select(estimate) %>% as.double()
}
fitted_models %>% map_dbl(get_x)
## [1] 179.3946 180.6586 180.0705
If you want to compare models, rather than samples of your data, then modelr has support for that as well. You can make a list of the formulae you want to fit. The formulae() function lets you create such a list.
models <- formulae(~y, linear = ~x, quadratic = ~I(x^2) + x)

The first argument is the response variable (the left-hand side of the formulae), and the remaining arguments are (named) parameters that describe the explanatory variables, the righthand part of a model formula.

If you call fit_with() with your data, the fitting function to use (here lm()) and the formulae you wish to fit, then you get what you want—a fit for each formula.
fits <- fit_with(x, lm, models)
fits %>% map(glance) %>% bind_rows()
## # A tibble: 2 x 11
##   r.squared adj.r.squared sigma statistic    p.value
##       <dbl>         <dbl> <dbl>     <dbl>      <dbl>
## 1     1.000         1.000 0.433    24589.    5.72e-7
## 2     1.000         1.000 0.527     8310.    1.20e-4
## # . . . with 6 more variables: df <int>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>,
## # deviance <dbl>, df.residual <int>
You will find many model quality measures in modelr. For example, root-mean-square error
fits %>% map_dbl(rmse, data = x)
##    linear quadratic
## 0.3354516 0.3331588
mean-absolute-error
fits %>% map_dbl(mae, data = x)
##    linear quadratic
## 0.2826265 0.2595208

and many more.

Since overfitting is always a problem, you might want to use a quality measure that, at least attemps to, take model complexity into account. You have some in the glance() function from broom.
fits %>% map_dbl(~ glance(.x)$AIC)
##   linear quadratic
## 9.266611 11.198024
If at all possible, however, you want to use test data to measure how well a model generalizes. For this, you first need to fit your models to the training data and then make predictions on the test data. In the following example, I have fitted lm(y ~ x) on leave-one-out data and then applied it on the test data. I then measure the quality of the generalization using RMSE.
samples <- crossv_loo(x)
training_fits <-
  samples$train %>% map(~lm(y ~ x, data = .))
test_measurement <- training_fits %>%
  map2_dbl(samples$test, rmse)
test_measurement
##         1         2         3         4         5
## 0.2618250 0.5535395 0.1963251 0.8401988 0.3776097
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.56.45