Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

T. MailundR 4 Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-8780-4_12

12. Working with Models: broom and modelr

Thomas Mailund¹

(1)

Aarhus, Denmark

There are many models to which you can fit your data, from classical statistical models to modern machine learning methods, and a thorough exploration of R packages that support this is well beyond the scope of this book. The main concerns when choosing and fitting models is not the syntax, and this book is, after all, a syntax reference. We will look at two packages that aim at making a tidy interface to models.

The two packages, broom and modelr, are not loaded with tidyverse, so you must load them individually.

library(broom)

library(modelr)

broom

When you fit a model, you get an object in return that holds information about the data and the fit. This data is represented in different ways—it depends on the implementation of the function used to fit the data. For a linear model, for example, we get this information :

model <- lm(disp ~ hp + wt, data = mtcars)

summary(model)

## Call:

## lm(formula = disp ~ hp + wt, data = mtcars)

## Residuals:

## Min 1Q Median 3Q Max

## -82.565 -23.802 2.111 35.731 99.107

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -129.9506 29.1890 -4.452 0.000116

## hp 0.6578 0.1649 3.990 0.000411

## wt 82.1125 11.5518 7.108 8.04e-08

## (Intercept) ***

## hp ***

## wt ***

## ---

## Signif. codes:

## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 47.35 on 29 degrees of freedom

## Multiple R-squared: 0.8635, Adjusted R-squared: 0.8541

## F-statistic: 91.71 on 2 and 29 DF, p-value: 2.889e-13

The problem with this representation is that it can be difficult to extract relevant data because the data isn’t tidy. The broom package fixes this. It defines three generic functions, tidy(), glance(), and augment(). These functions return tibbles. The first gives you the fit, the second a summary of how good the fit is, and the third gives you the original data augmented with fit summaries.

tidy(model) # transform to tidy tibble

## # A tibble: 3 × 5

## term estimate std.error statistic p.value

## <chr> <dbl> <dbl> <dbl> <dbl>

## 1 (Intercept) -130. 29.2 -4.45 1.16e-4

## 2 hp 0.658 0.165 3.99 4.11e-4

## 3 wt 82.1 11.6 7.11 8.04e-8

glance(model) # model summaries

## # A tibble: 1 × 12

## r.squared adj.r.squared sigma statistic p.value

## <dbl> <dbl> <dbl> <dbl> <dbl>

## 1 0.863 0.854 47.3 91.7 2.89e-13

## # . . . with 7 more variables: df <dbl>,

## # logLik <dbl>, AIC <dbl>, BIC <dbl>,

## # deviance <dbl>, df.residual <int>, nobs <int>

augment(model) # add model info to data

## # A tibble: 32 × 10

## .rownames disp hp wt .fitted .resid

## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>

## 1 Mazda RX4 160 110 2.62 158. 2.45

## 2 Mazda RX4 Wag 160 110 2.88 178. -18.5

## 3 Datsun 710 108 93 2.32 122. -13.7

## 4 Hornet 4 Drive 258 110 3.22 206. 51.6

## 5 Hornet Sporta. . . 360 175 3.44 268. 92.4

## 6 Valiant 225 105 3.46 223. 1.77

## 7 Duster 360 360 245 3.57 324. 35.6

## 8 Merc 240D 147. 62 3.19 173. -26.1

## 9 Merc 230 141. 95 3.15 191. -50.4

## 10 Merc 280 168. 123 3.44 233. -65.8

## # . . . with 22 more rows, and 4 more variables:

## # .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,

## # .std.resid <dbl>

The broom package implements specializations for most models. Not all of the three functions are meaningful for all models, so some models only have a subset of the functions. If you want your own model to work with broom, let us call it mymodel, then you have to implement specializations of the functions, tidy.mymodel(), glance.mymodel(), and augment.mymodel(), that are relevant for the model.

modelr

The modelr package also provides functionality for fitting and inspecting models and for extracting information about model fits. We start with the latter.

Consider the following example model:

# Build a model where variable x can help us predict response y

dat <- tibble(

x = runif(50),

y = 15 * xˆ2 * x + 42 + rnorm(5)

)

# Fit a linear model to the data (even though y is quadratic in x)

model <- lm(y ~ x, data = dat)

tidy(model)

## # A tibble: 2 × 5

## term estimate std.error statistic p.value

## <chr> <dbl> <dbl> <dbl> <dbl>

## 1 (Intercep. . . 38.9 0.522 74.5 2.82e-51

## 2 x 15.1 0.829 18.2 3.67e-23

We have fitted a linear model to data that is sampled from a linear model. The details of the data and model do not matter, it is just an example. I have used the broom function tidy() to inspect the model.

Two functions, add_predictions() and add_residuals() , extend your data with predictions the model would make from each data row and the residuals for each row.

add_predictions(dat, model)

## # A tibble: 50 × 3

## x y pred

## <dbl> <dbl> <dbl>

## 1 0.633 45.7 48.5

## 2 0.385 43.4 44.7

## 3 0.566 46.1 47.4

## 4 0.922 53.5 52.8

## 5 0.976 56.0 53.6

## 6 0.933 54.1 53.0

## 7 0.381 43.4 44.7

## 8 0.256 43.7 42.8

## 9 0.257 42.0 42.8

## 10 0.197 42.2 41.9

## # . . . with 40 more rows

add_residuals(dat, model)

## # A tibble: 50 × 3

## x y resid

## <dbl> <dbl> <dbl>

## 1 0.633 45.7 -2.76

## 2 0.385 43.4 -1.27

## 3 0.566 46.1 -1.32

## 4 0.922 53.5 0.690

## 5 0.976 56.0 2.37

## 6 0.933 54.1 1.10

## 7 0.381 43.4 -1.24

## 8 0.256 43.7 0.890

## 9 0.257 42.0 -0.797

## 10 0.197 42.2 0.281

## # . . . with 40 more rows

Predictions need not be for existing data. You can create a data frame of explanatory variables and predict the response variable from the new data.

new_dat <- tibble(x = seq(0, 1, length.out = 5))

add_predictions(new_dat, model)

## # A tibble: 5 × 2

## x pred

## <dbl> <dbl>

## 1 0 38.9

## 2 0.25 42.7

## 3 0.5 46.5

## 4 0.75 50.2

## 5 1 54.0

I know that the x values are in the range from zero to one, but we cannot always a priori know the range that a variable falls within. If you don’t know, you can use the seq_range() function to get equidistant points in the range from the lowest to the highest value in your data.

seq(0, 1, length.out = 5)

## [1] 0.00 0.25 0.50 0.75 1.00

seq_range(dat$x, n = 5) # over the range of observations

## [1] 0.02272462 0.26437738 0.50603015 0.74768292

## [5] 0.98933569

If you have two models and want to know how they compare with respect to their predictions, you can use gather_predictions() and spread_predictions() :

# comparing line to a (better) model y ~ xˆ2 + x + 1

model2 <- lm(y ~ I(xˆ2) + x, data = dat)

gather_predictions(new_dat, model, model2)

## # A tibble: 10 × 3

## model x pred

## <chr> <dbl> <dbl>

## 1 model 0 38.9

## 2 model 0.25 42.7

## 3 model 0.5 46.5

## 4 model 0.75 50.2

## 5 model 1 54.0

## 6 model2 0 43.2

## 7 model2 0.25 42.3

## 8 model2 0.5 44.2

## 9 model2 0.75 49.0

## 10 model2 1 56.8

spread_predictions(new_dat, model, model2)

## # A tibble: 5 × 3

## x model model2

## <dbl> <dbl> <dbl>

## 1 0 38.9 43.2

## 2 0.25 42.7 42.3

## 3 0.5 46.5 44.2

## 4 0.75 50.2 49.0

## 5 1 54.0 56.8

They show the same data, just with the tables formatted differently. The names gather and spread resemble the tidyr functions pivot_longer and pivot_wider, but the names are taken from the functions gather and spread, also from tidyr, that are deprecated versions of the pivot functions.

Earlier, we made predictions on new data, but you can, of course, also do it on your original data.

gather_predictions(dat, model, model2)

## # A tibble: 100 × 4

## model x y pred

## <chr> <dbl> <dbl> <dbl>

## 1 model 0.633 45.7 48.5

## 2 model 0.385 43.4 44.7

## 3 model 0.566 46.1 47.4

## 4 model 0.922 53.5 52.8

## 5 model 0.976 56.0 53.6

## 6 model 0.933 54.1 53.0

## 7 model 0.381 43.4 44.7

## 8 model 0.256 43.7 42.8

## 9 model 0.257 42.0 42.8

## 10 model 0.197 42.2 41.9

## # . . . with 90 more rows

spread_predictions(dat, model, model2)

## # A tibble: 50 × 4

## x y model model2

## <dbl> <dbl> <dbl> <dbl>

## 1 0.633 45.7 48.5 46.4

## 2 0.385 43.4 44.7 42.9

## 3 0.566 46.1 47.4 45.2

## 4 0.922 53.5 52.8 54.1

## 5 0.976 56.0 53.6 55.9

## 6 0.933 54.1 53.0 54.4

## 7 0.381 43.4 44.7 42.9

## 8 0.256 43.7 42.8 42.3

## 9 0.257 42.0 42.8 42.3

## 10 0.197 42.2 41.9 42.2

## # . . . with 40 more rows

If you have the original data, you can also get residuals.

gather_residuals(dat, model, model2)

## # A tibble: 100 × 4

## model x y resid

## <chr> <dbl> <dbl> <dbl>

## 1 model 0.633 45.7 -2.76

## 2 model 0.385 43.4 -1.27

## 3 model 0.566 46.1 -1.32

## 4 model 0.922 53.5 0.690

## 5 model 0.976 56.0 2.37

## 6 model 0.933 54.1 1.10

## 7 model 0.381 43.4 -1.24

## 8 model 0.256 43.7 0.890

## 9 model 0.257 42.0 -0.797

## 10 model 0.197 42.2 0.281

## # . . . with 90 more rows

spread_residuals(dat, model, model2)

## # A tibble: 50 × 4

## x y model model2

## <dbl> <dbl> <dbl> <dbl>

## 1 0.633 45.7 -2.76 -0.713

## 2 0.385 43.4 -1.27 0.500

## 3 0.566 46.1 -1.32 0.938

## 4 0.922 53.5 0.690 -0.552

## 5 0.976 56.0 2.37 0.0826

## 6 0.933 54.1 1.10 -0.348

## 7 0.381 43.4 -1.24 0.505

## 8 0.256 43.7 0.890 1.39

## 9 0.257 42.0 -0.797 -0.275

## 10 0.197 42.2 0.281 -0.0587

## # . . . with 40 more rows

Depending on the type of data science you usually do, you might have to sample to get empirical distributions or to split your data into training and test data to avoid overfitting. With modelr, you have functions for this.

You can build n data sets using bootstrapping with the function bootstrap() .¹

bootstrap(dat, n = 3)

## # A tibble: 3 × 2

## strap .id

## <list> <chr>

## 1 <resample [50 x 2]> 1

## 2 <resample [50 x 2]> 2

## 3 <resample [50 x 2]> 3

It samples random data points and creates n new data sets from this. The resulting tibble has two columns; the first, strap, contains the data for each sample and the second an id.

The crossv_mc() function —Monte Carlo cross-validation—creates cross-validation data, that is, it splits your data into training and test data. It creates n random data sets divided into test and training data.

crossv_mc(dat, n = 3)

## # A tibble: 3 × 3

## train test .id

## <list> <list> <chr>

## 1 <resample [39 x 2]> <resample [11 x 2]> 1

## 2 <resample [39 x 2]> <resample [11 x 2]> 2

## 3 <resample [39 x 2]> <resample [11 x 2]> 3

By default, the test data is 20% of the sampled data; you can change this using the test argument.

The crossv_kfold and crossv_loo gives you k-fold data and leave-one-out data, respectively.

crossv_kfold(dat, k = 3)

## # A tibble: 3 × 3

## train test .id

## <named list> <named list> <chr>

## 1 <resample [33 x 2]> <resample [17 x 2]> 1

## 2 <resample [33 x 2]> <resample [17 x 2]> 2

## 3 <resample [34 x 2]> <resample [16 x 2]> 3

crossv_loo(dat)

## # A tibble: 50 × 3

## train test .id

## <named list> <named list> <int>

## 1 <resample [49 x 2]> <resample [1 x 2]> 1

## 2 <resample [49 x 2]> <resample [1 x 2]> 2

## 3 <resample [49 x 2]> <resample [1 x 2]> 3

## 4 <resample [49 x 2]> <resample [1 x 2]> 4

## 5 <resample [49 x 2]> <resample [1 x 2]> 5

## 6 <resample [49 x 2]> <resample [1 x 2]> 6

## 7 <resample [49 x 2]> <resample [1 x 2]> 7

## 8 <resample [49 x 2]> <resample [1 x 2]> 8

## 9 <resample [49 x 2]> <resample [1 x 2]> 9

## 10 <resample [49 x 2]> <resample [1 x 2]> 10

## # . . . with 40 more rows

As an example, say you have sampled three bootstrap data sets . The samples are in the strap column, so we can map over it and fit a linear model to each sampled data set.

samples <- bootstrap(dat, 3)

fitted_models <- samples |>

mutate(

# Map over all the bootstrap samples and fit each of them

fits = strap |> map((dat) lm(y ~ x, data = dat))

)

fitted_models

## # A tibble: 3 × 3

## strap .id fits

## <list> <chr> <list>

## 1 <resample [50 x 2]> 1 <lm>

## 2 <resample [50 x 2]> 2 <lm>

## 3 <resample [50 x 2]> 3 <lm>

fitted_models$fits[[1]]

## Call:

## lm(formula = y ~ x, data = dat)

## Coefficients:

## (Intercept) x

## 38.46 15.54

Then we can map over the three models and inspect them using broom’s glance() function :

fitted_models$fits |>

map(glance) |>

bind_rows()

## # A tibble: 3 × 12

## r.squared adj.r.squared sigma statistic p.value

## <dbl> <dbl> <dbl> <dbl> <dbl>

## 1 0.871 0.869 1.85 325. 5.07e-23

## 2 0.813 0.809 1.89 209. 4.12e-19

## 3 0.904 0.902 1.58 451. 4.63e-26

## # . . . with 7 more variables: df <dbl>,

## # logLik <dbl>, AIC <dbl>, BIC <dbl>,

## # deviance <dbl>, df.residual <int>, nobs <int>

If we were interested in the empirical distribution of x, then we could extract the distribution over the bootstrapped data and go from there.

get_x <- function(m) {

tidy(m) |> filter(term == "x") |>

select(estimate) %>% as.double()

}

fitted_models$fits |> map_dbl(get_x)

## [1] 15.54085 13.25745 15.50705

If you want to compare models, rather than samples of your data, then modelr has support for that as well. You can make a list of the formulae you want to fit. The formulae() function lets you create such a list.

models <- formulae(~y, linear = ~x, quadratic = ~I(xˆ2) + x)

The first argument is the response variable (the left-hand side of the formulae), and the remaining arguments are (named) parameters that describe the explanatory variables, the right-hand part of a model formula.

If you call fit_with() with your data, the fitting function to use (here lm()), and the formulae you wish to fit, then you get what you want—a fit for each formula.

fits <- fit_with(dat, lm, models)

fits |> map(glance) |> bind_rows()

## # A tibble: 2 × 12

## r.squared adj.r.squared sigma statistic p.value

## <dbl> <dbl> <dbl> <dbl> <dbl>

## 1 0.873 0.871 1.83 330. 3.67e-23

## 2 0.983 0.983 0.674 1379. 1.86e-42

## # . . . with 7 more variables: df <dbl>,

## # logLik <dbl>, AIC <dbl>, BIC <dbl>,

## # deviance <dbl>, df.residual <int>, nobs <int>

You will find many model quality measures in modelr, for example, root mean square error :

fits |> map_dbl(rmse, data = dat)

## linear quadratic

## 1.7973686 0.6533399

mean absolute error

fits |> map_dbl(mae, data = dat)

## linear quadratic

## 1.5265158 0.5116268

and many more.

Since overfitting is always a problem, you might want to use a quality measure that at least attempts to take model complexity into account. You have some in the glance() function from broom.

fits |> map_dbl(~ glance(.x)$AIC)

## linear quadratic

## 206.5262 107.3281

If at all possible, however, you want to use test data to measure how well a model generalizes. For this, you first need to fit your models to the training data and then make predictions on the test data. In the following example, I have fitted lm(y ~ x) on leave-one-out data and then applied it on the test data. I then measure the quality of the generalization using RMSE.

samples <- dat |> crossv_loo()

training_fits <- samples$train |> map(~lm(y ~ x, data = .))

training_fits |> map2_dbl(samples$test, rmse) |> head(10)

## 1 2 3 4 5

## 2.8223916 1.3057640 1.3454227 0.7254241 2.5129973

## 6 7 8 9 10

## 1.1552629 1.2763827 0.9248123 0.8273398 0.2941659

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12. Working with Models: broom and modelr

Create new playlist

Sign In

Sign Up

12. Working with Models: broom and modelr

broom

modelr

Table of Contents for
12. Working with Models: broom and modelr