Model selection tools – the best subsets regression

We will use the sleep duration dataset to illustrate the use of best subsets regression as a model reduction tool. The best subsets will check through all possible linear models and display the best models at each step.

When running the regression of sleep duration in the Multiple regression with linear predictors recipe, we observed that the residuals showed unequal variance. Initially, we will transform the data by taking the lognormal of the sleep duration response. The transformed response will show homoscedasticity.

We will use best subsets regression to show only the best model at each number of variables to include from 1 to all predictors. After identifying the model, we will enter the results into the general regression study.

Getting ready

The data is available at the following link from StatSci.org:

http://www.statsci.org/data/general/sleep.html

The data is tab delimited and can be copied directly into the worksheet.

How to do it…

The following steps will transform the sleep duration response by taking the natural log of the results. Then, best subsets regression is used to identify a regression model to use.

  1. Go to the Calc menu and select Calculator….
  2. Enter a column name of Ln(Sleep) in the section for Store result in variable: to create the transformed data.
  3. Enter the expression as shown in the following screenshot:
    How to do it…
  4. Check the Assign as a formula option and click on OK.
  5. Navigate to Stat | Regression | Regression and select Best Subsets Regression….
  6. Enter the Ln(Sleep) column in the Response: section.
  7. Enter the following columns as Free Predictors:, as shown in the following screenshot:
    How to do it…
  8. Click on Options… and change the Models of each size to print: section from 2 to 1.
  9. Click on OK in each dialog box.
  10. Check the results in the session window, as shown in the following screenshot:
    How to do it…
  11. Compare the results for highest R-Sq(adj), lowest standard deviation, and Mallows Cp for a score close to the number of predictors. Find the row that corresponds to that result.
  12. Navigate to Stat | Regression | Regression and then select Fit Regression Model….
  13. Enter the Ln(sleep) column as the response.
  14. In the Model: section, enter the columns for our chosen predictors as selected from the best subset results, which are Gestation, Predation, and Danger.
  15. To check the assumptions of running a regression, create residual plots by clicking on the Graphs… button and selecting Normal plot of residuals and Residuals versus fits.
  16. Click on OK in each dialog box to run the regression.

How it works…

In the previous recipe, we obtained residual plots that appear to show the funneling of the residuals versus fits. This seems to indicate that the variance is changing with the predicted values. By taking the natural log of the recorded sleep duration, we can ensure that the variances of the natural log of sleep durations remains roughly constant across fitted values. The following screenshot shows the results of the regression coefficients in Minitab:

How it works…

The following chart shows the comparison of the residuals using the TotalSleep results and the natural log of TotalSleep. We should also notice that the use of the natural log of the sleep durations results in a slightly improved fit to the results from measures such as R-Sq(pred) and R-sq(adj).

How it works…

Fit Regression Model… can use a Box-Cox transformation on the original data as part of the analysis. This is found inside the Options… section. As the best subsets regression does not have this option, we will need to use the calculator to transform the results.

The best subsets procedure will identify models that produce the highest R-squared terms and display these in the session window. The options for this tool allow us to pick how many R-squared terms at each number of variables are displayed. The default options will display the two best models with a 1 predictor, then 2, and then 3, until we reach the full model.

Entering columns as Free Predictors allows these variables to be included or excluded from the model. Entering a variable into the Predictors in all models: section will force the best subsets regression to always include this variable.

Best subsets will only look for linear model terms. Interactions or quadratics cannot be specified in this dialog box.

See also

  • The Model selection tools – the stepwise regression recipe
  • The Multiple regression with linear predictors recipe
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.198.174