Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Thomas MailundR Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-4894-2_11

11. Working with Models: broom and modelr

Thomas Mailund¹

(1)

Aarhus, Denmark

There are many models to which you can fit your data, from classical statistical models to modern machine learning methods, and a thorough exploration of R packages that support this is well beyond the scope of this book. The main concern when choosing and fitting models is not the syntax, and this book is, after all, a syntax reference. We will look at two packages that aim at making a tidy interface to models.

The two packages, broom and modelr, are not loaded with tidyverse so you must load them individually.

library(broom)

library(modelr)

broom

When you fit a model, you get an object in return that holds information about the data and the fit. This data is represented in different ways—it depends on the implementation of the function used to fit the data. For a linear model, for example, we get this information:

model <- lm(disp ~ hp + wt, data = mtcars)

summary(model)

## Call:

## lm(formula = disp ~ hp + wt, data = mtcars)

## Residuals:

## Min 1Q Median 3Q Max

## -82.565 -23.802 2.111 35.731 99.107

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -129.9506 29.1890 -4.452 0.000116

## hp 0.6578 0.1649 3.990 0.000411

## wt 82.1125 11.5518 7.108 8.04e-08

## (Intercept) ***

## hp ***

## wt ***

## ---

## Signif. codes:

## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 47.35 on 29 degrees of freedom

## Multiple R-squared: 0.8635, Adjusted R-squared: 0.8541

## F-statistic: 91.71 on 2 and 29 DF, p-value: 2.889e-13

The problem with this representation is that it can be difficult to extract relevant data because the data isn’t tidy. The broom package fixes this. It defines three generic function, tidy() , glance(), and augment(). These functions return tibbles. The first gives you the fit, the second a summary of how good the fit is, and the third gives you the original data augmented with fit summaries.

tidy(model) # transform to tidy tibble

## # A tibble: 3 x 5

## term estimate std.error statistic p.value

## <chr> <dbl> <dbl> <dbl> <dbl>

## 1 (Interce... -130. 29.2 -4.45 1.16e-4

## 2 hp 0.658 0.165 3.99 4.11e-4

## 3 wt 82.1 11.6 7.11 8.04e-8

glance(model) # model summaries

## # A tibble: 1 x 11

## r.squared adj.r.squared sigma statistic p.value

## <dbl> <dbl> <dbl> <dbl> <dbl>

## 1 0.863 0.854 47.3 91.7 2.89e-13

## # ... with 6 more variables: df <int>,

## # logLik <dbl>, AIC <dbl>, BIC <dbl>,

## # deviance <dbl>, df.residual <int>

augment(model) # add model info to data

## # A tibble: 32 x 11

## .rownames disp hp wt .fitted .se.fit

## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>

## 1 Mazda RX4 160 110 2.62 158. 9.96

## 2 Mazda RX. . . 160 110 2.88 178. 9.53

## 3 Datsun 7. . . 108 93 2.32 122. 11.6

## 4 Hornet 4. . . 258 110 3.22 206. 10.3

## 5 Hornet S. . . 360 175 3.44 268. 9.09

## 6 Valiant 225 105 3.46 223. 12.3

## 7 Duster 3. . . 360 245 3.57 324. 16.2

## 8 Merc 240D 147. 62 3.19 173. 16.1

## 9 Merc 230 141. 95 3.15 191. 11.6

## 10 Merc 280 168. 123 3.44 233. 10.3

## # . . . with 22 more rows, and 5 more variables:

## # .resid <dbl>, .hat <dbl>, .sigma <dbl>,

## # .cooksd <dbl>, .std.resid <dbl>

The broom package implements specializations for most models. Not all of the three functions are meaningful for all models, so some models only have a subset of the functions. If you want your own model to work with broom, let us call it mymodel, then you have to implement specializations of the functions, tidy.mymodel(), glance.mymodel(), and augment.mymodel(), that are relevant for the model.

modelr

The modelr package also provides functionality for fitting and inspecting models and for extracting information about model fits. We start with the latter.

Consider the following example model.

x <- tibble(

x = runif(5),

y = 15 *12 * x + 42 + rnorm(5)

)

model <- lm(y ~ x, data = x)

tidy(model)

## # A tibble: 2 x 5

## term estimate std.error statistic p.value

## <chr> <dbl> <dbl> <dbl> <dbl>

## 1 (Interce. . . 41.8 0.271 154. 5.99e-7

## 2 x 180. 1.15 157. 5.72e-7

We have fitted a linear model to data that is sampled from a linear model. The details of the data and model do not matter; it is just an example. I have used the broom function tidy() to inspect the model.

Two functions, add_predictions() and add_residuals() , extend your data with predictions the model would make from each data row and the residuals for each row.

add_predictions(x, model)

## # A tibble: 5 x 3

## x y pred

## <dbl> <dbl> <dbl>

## 1 0.445 122. 122.

## 2 0.0594 52.9 52.5

## 3 0.275 91.3 91.4

## 4 0.0311 46.9 47.5

## 5 0.0145 44.7 44.4

add_residuals(x, model)

## # A tibble: 5 x 3

## x y resid

## <dbl> <dbl> <dbl>

## 1 0.445 122. 0.0653

## 2 0.0594 52.9 0.399

## 3 0.275 91.3 -0.140

## 4 0.0311 46.9 -0.566

## 5 0.0145 44.7 0.242

Predictions need not be for existing data. You can create a data frame of explanatory variables and predict the response variable from the new data.

xs <- tibble(x = seq(0, 1, length.out = 5))

add_predictions(xs, model)

## # A tibble: 5 x 2

## x pred

## <dbl> <dbl>

## 1 0 41.8

## 2 0.25 86.9

## 3 0.5 132.

## 4 0.75 177.

## 5 1 222.

I know that the x values are in the range from zero to one, but we cannot always a priori know the range that a variable falls within. If you don’t know, you can use the seq_range() function to get equidistant points in the range from the lowest to the highest value in your data.

seq(0, 1, length.out = 5)

## [1] 0.00 0.25 0.50 0.75 1.00

seq_range(x$x, n = 5) # over the range of observations

## [1] 0.01448234 0.12204440 0.22960645 0.33716850

## [5] 0.44473056

If you have two models and want to know how they compare with respect to their predictions, you can use gather_predictions() and spread_predictions():

# comparing models

model2 <- lm(y ~ I(x^2) + x, data = x)

gather_predictions(xs, model, model2)

## # A tibble: 10 x 3

## model x pred

## <chr> <dbl> <dbl>

## 1 model 0 41.8

## 2 model 0.25 86.9

## 3 model 0.5 132.

## 4 model 0.75 177.

## 5 model 1 222.

## 6 model2 0 41.9

## 7 model2 0.25 86.8

## 8 model2 0.5 132.

## 9 model2 0.75 178.

## 10 model2 1 223.

spread_predictions(xs, model, model2)

## # A tibble: 5 x 3

## x model model2

## <dbl> <dbl> <dbl>

## 1 0 41.8 41.9

## 2 0.25 86.9 86.8

## 3 0.5 132. 132.

## 4 0.75 177. 178.

## 5 1 222. 223.

They show the same data, just with the tables formatted differently. The names gather and spread resembles the tidyr functions gather and spread .

In the previous example, we made predictions on new data, but you can, of course, also do it on your original data.

gather_predictions(x, model, model2)

## # A tibble: 10 x 4

## model x y pred

## <chr> <dbl> <dbl> <dbl>

## 1 model 0.445 122. 122.

## 2 model 0.0594 52.9 52.5

## 3 model 0.275 91.3 91.4

## 4 model 0.0311 46.9 47.5

## 5 model 0.0145 44.7 44.4

## 6 model2 0.445 122. 122.

## 7 model2 0.0594 52.9 52.5

## 8 model2 0.275 91.3 91.3

## 9 model2 0.0311 46.9 47.5

## 10 model2 0.0145 44.7 44.5

spread_predictions(x, model, model2)

## # A tibble: 5 x 4

## x y model model2

## <dbl> <dbl> <dbl> <dbl>

## 1 0.445 122. 122. 122.

## 2 0.0594 52.9 52.5 52.5

## 3 0.275 91.3 91.4 91.3

## 4 0.0311 46.9 47.5 47.5

## 5 0.0145 44.7 44.4 44.5

If you have the original data, you can also get residuals.

gather_residuals(x, model, model2)

## # A tibble: 10 x 4

## model x y resid

## <chr> <dbl> <dbl> <dbl>

## 1 model 0.445 122. 0.0653

## 2 model 0.0594 52.9 0.399

## 3 model 0.275 91.3 -0.140

## 4 model 0.0311 46.9 -0.566

## 5 model 0.0145 44.7 0.242

## 6 model2 0.445 122. 0.0225

## 7 model2 0.0594 52.9 0.412

## 8 model2 0.275 91.3 -0.0710

## 9 model2 0.0311 46.9 -0.578

## 10 model2 0.0145 44.7 0.215

spread_residuals(x, model, model2)

## # A tibble: 5 x 4

## x y model model2

## <dbl> <dbl> <dbl> <dbl>

## 1 0.445 122. 0.0653 0.0225

## 2 0.0594 52.9 0.399 0.412

## 3 0.275 91.3 -0.140 -0.0710

## 4 0.0311 46.9 -0.566 -0.578

## 5 0.0145 44.7 0.242 0.215

Depending on the type of data science you usually do, you might have to sample to get empirical distributions or to split your data into training and test data to avoid overfitting. With modelr you have functions for this.

You can build n data sets using bootstrapping with the function bootstrap() .¹

bootstrap(x, n = 3)

## # A tibble: 3 x 2

## strap .id

## <list> <chr>

## 1 <resample> 1

## 2 <resample> 2

## 3 <resample> 3

It samples random data points and creates n new data sets from this. The resulting tibble has two columns: the first, strap, contains the data for each sample and the second an id.

The crossv_mc() function—Monte Carlo cross-validation—creates cross-validation data, that is, it splits your data into training and test data. It creates n random data sets divided into test and training data.

crossv_mc(x, n = 3)

## # A tibble: 3 x 3

## train test .id

## <list> <list> <chr>

## 1 <resample> <resample> 1

## 2 <resample> <resample> 2

## 3 <resample> <resample> 3

By default, the test data is 20% of the sampled data; you can change this using the test argument.

The crossv_kfold and crossv_loo give you k-fold data and leave-one-out data, respectively.

crossv_kfold(x, k = 3)

## # A tibble: 3 x 3

## train test .id

## <list> <list> <chr>

## 1 <resample> <resample> 1

## 2 <resample> <resample> 2

## 3 <resample> <resample> 3

crossv_loo(x)

## # A tibble: 5 x 3

## train test .id

## <list> <list> <int>

## 1 <resample> <resample> 1

## 2 <resample> <resample> 2

## 3 <resample> <resample> 3

## 4 <resample> <resample> 4

## 5 <resample> <resample> 5

As an example, say you have sampled three bootstrap data sets. The samples are in the strap column, so we can map over it and fit a linear model to each sampled data set.

samples <- bootstrap(x, 3)

fitted_models <- samples$strap %>%

map(~ lm(y ~ x, data = .))

Then we can map over the three models and inspect them using broom's glance() function :

fitted_models %>%

map(glance) %>%

bind_rows()

## # A tibble: 3 x 11

## r.squared adj.r.squared sigma statistic p.value

## <dbl> <dbl> <dbl> <dbl> <dbl>

## 1 1.000 1.000 0.171 160710. 3.42e-8

## 2 1.000 1.000 0.513 16835. 1.01e-6

## 3 1.000 1.000 0.485 17065. 9.89e-7

## # . . . with 6 more variables: df <int>,

## # logLik <dbl>, AIC <dbl>, BIC <dbl>,

## # deviance <dbl>, df.residual <int>

If we were interested in the empirical distribution of x, then we could extract the distribution over the bootstrapped data and go from there.

get_x <- function(m) {

tidy(m) %>% filter(term == "x") %>%

select(estimate) %>% as.double()

}

fitted_models %>% map_dbl(get_x)

## [1] 179.3946 180.6586 180.0705

If you want to compare models, rather than samples of your data, then modelr has support for that as well. You can make a list of the formulae you want to fit. The formulae() function lets you create such a list.

models <- formulae(~y, linear = ~x, quadratic = ~I(x^2) + x)

The first argument is the response variable (the left-hand side of the formulae), and the remaining arguments are (named) parameters that describe the explanatory variables, the righthand part of a model formula.

If you call fit_with() with your data, the fitting function to use (here lm()) and the formulae you wish to fit, then you get what you want—a fit for each formula.

fits <- fit_with(x, lm, models)

fits %>% map(glance) %>% bind_rows()

## # A tibble: 2 x 11

## r.squared adj.r.squared sigma statistic p.value

## <dbl> <dbl> <dbl> <dbl> <dbl>

## 1 1.000 1.000 0.433 24589. 5.72e-7

## 2 1.000 1.000 0.527 8310. 1.20e-4

## # . . . with 6 more variables: df <int>,

## # logLik <dbl>, AIC <dbl>, BIC <dbl>,

## # deviance <dbl>, df.residual <int>

You will find many model quality measures in modelr. For example, root-mean-square error

fits %>% map_dbl(rmse, data = x)

## linear quadratic

## 0.3354516 0.3331588

mean-absolute-error

fits %>% map_dbl(mae, data = x)

## linear quadratic

## 0.2826265 0.2595208

and many more.

Since overfitting is always a problem, you might want to use a quality measure that, at least attemps to, take model complexity into account. You have some in the glance() function from broom.

fits %>% map_dbl(~ glance(.x)$AIC)

## linear quadratic

## 9.266611 11.198024

If at all possible, however, you want to use test data to measure how well a model generalizes. For this, you first need to fit your models to the training data and then make predictions on the test data. In the following example, I have fitted lm(y ~ x) on leave-one-out data and then applied it on the test data. I then measure the quality of the generalization using RMSE.

samples <- crossv_loo(x)

training_fits <-

samples$train %>% map(~lm(y ~ x, data = .))

test_measurement <- training_fits %>%

map2_dbl(samples$test, rmse)

test_measurement

## 1 2 3 4 5

## 0.2618250 0.5535395 0.1963251 0.8401988 0.3776097

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11. Working with Models: broom and modelr

Create new playlist

Sign In

Sign Up

11. Working with Models: broom and modelr

broom

modelr

Table of Contents for
11. Working with Models: broom and modelr