Time for action - model development

Let us continue to the most extensive phase of our data analysis, which consists of developing the optimal regression model for our situation. Ultimately, we want to predict the performance rating of the Shu army under potential fire attack strategies. From our previous exploration of the data, we have reason to believe that successful execution greatly influences the outcome of battle. We can also infer that the duration of a battle has some impact on its outcome. At the same time, it appears that the number of soldiers engaged in battle does not have a large impact on the result. However, since the numbers of Shu and Wei soldiers themselves are highly correlated, there is a potential interaction effect between the two that is worth investigating. We will start by using our insights to create a set of potentially useful models:

  1. Use the glm(formula, data) function to create a series of potential linear models that predict the Rating of battle (dependent variable) using one or more of the independent variables in our dataset. Then, use the summary(object) command to assess the statistical significance of each model:
    > #create a linear regression model using the
    glm(formula, data) function
    > #predict the rating of battle using execution
    > lmFireRating_Execution <- glm(Rating ~ SuccessfullyExecuted,
    data = subsetFire)
    > #generate a summary of the model
    > lmFireRating_Execution_Summary <-
    summary(lmFireRating_Execution)
    > #display the model summary
    > lmFireRating_Execution_Summary
    > #keep execution in the model as an independent variable
    
    Time for action - model development
  2. Our first model used only the successful (or unsuccessful) execution of battle plans to predict the performance of the Shu army in a fire attack. Our summary tells us that execution is an important factor to include in the model.

    Note

    For a review of regression model interpretation, refer to the Regression section of Chapter 5.

  3. Now, let us examine the impact that the duration of battle has on our model:
    > #predict the rating of battle using execution and duration
    > lmFireRating_ExecutionDuration <-
    glm(Rating ~ SuccessfullyExecuted + DurationInDays,
    data = subsetFire)
    > #generate a summary of the model
    > lmFireRating_ExecutionDuration_Summary <-
    summary(lmFireRating_ExecutionDuration)
    > #display the model summary
    > lmFireRating_ExecutionDuration_Summary
    >#keep duration in the model as independent variable
    
    Time for action - model development
  4. This model added the duration of battle to execution as a predictor of the Shu army's rating. Here, we found that duration is also an important predictor that should be included in the model.
  5. Next, we will inspect the prospects of including the number of Shu and Wei soldiers as predictors in our model:
    > #predict the rating of battle using execution, duration,
    and the number of Shu and Wei soldiers engaged
    > lmFireRating_ExecutionDurationSoldiers <-
    glm(Rating ~ SuccessfullyExecuted + DurationInDays +
    ShuSoldiersEngaged + WeiSoldiersEngaged, data = subsetFire)
    > #generate a summary of the model
    > lmFireRating_ExecutionDurationSoldiers_Summary <-
    summary(lmFireRating_ExecutionDurationSoldiers)
    > #display the model summary
    > lmFireRating_ExecutionDurationSoldiers_Summary
    > #drop the number of Shu and Wei soldiers from model
    as independent variables
    
    Time for action - model development
  6. This time, we added the number of Shu and Wei soldiers into our model, but determined that they were not significant enough predictors of the Shu army's performance. Therefore, we elected to exclude them from our model.
  7. Lastly, let us investigate the potential interaction effect between the number of Shu and Wei soldiers:
    > #investigate a potential interaction effect between the
    number of Shu and Wei soldiers
    > #center each variable by subtracting its mean from each
    of its values
    > centeredShuSoldiersFire <- subsetFire$ShuSoldiersEngaged
    - mean(subsetFire$ShuSoldiersEngaged)
    > centeredWeiSoldiersFire <- subsetFire$WeiSoldiersEngaged
    - mean(subsetFire$WeiSoldiersEngaged)
    > #multiply the two centered variables to create the
    interaction variable
    > interactionSoldiersFire <- centeredShuSoldiersFire
    * centeredWeiSoldiersFire
    > #predict the rating of battle using execution, duration,
    and the interaction between the number of Shu and Wei
    soldiers engaged
    > lmFireRating_ExecutionDurationShuWeiInteraction <-
    glm(Rating ~ SuccessfullyExecuted + DurationInDays +
    interactionSoldiersFire, data = subsetFire)
    > #generate a summary of the model
    lmFireRating_ExecutionDurationShuWeiInteraction_Summary
    <- summary(lmFireRating_ExecutionDurationShuWeiInteraction)
    > #display the model summary
    > lmFireRating_ExecutionDurationShuWeiInteraction_Summary
    > #keep the interaction between the number of Shu and Wei
    soldiers engaged in the model as an independent variable
    
    Time for action - model development
  8. We can see that the interaction effect between the number of Shu and Wei soldiers does have a meaningful impact on our model and should be included as an independent variable.

Note

Note that some statisticians may argue that it is inappropriate to include an interaction variable between the Shu and Wei soldiers in this model, without also including the number of Shu and Wei soldiers alone as variables in the model. In this fictitious example, there is no practically significant difference between these two options, and therefore, the interaction term has been included alone for the sake of simplicity and clarity. However, were you to incorporate interaction effects into your own regression models, you are advised to thoroughly investigate the implications of including or excluding certain variables.

We have identified four potential models. To determine which of these is most appropriate for predicting the outcome of our fire attack, we will use an approach known as Akaike Information Criterion, or AIC:

> #use the AIC(object, ...) function to compare the models
and choose the most appropriate one
> #when comparing via AIC, the lowest value indicates the
best statistical model
> AIC(lmFireRating_Execution, lmFireRating_ExecutionDuration,
lmFireRating_ExecutionDurationSoldiers,
lmFireRating_ExecutionDurationShuWeiInteraction)
> #according to AIC, our model that includes execution, duration, and the interaction effect is best
Time for action - model development

The AIC procedure revealed that our model containing execution, duration, and the interaction between the number of Shu and Wei soldiers is the best choice for predicting the performance of the Shu army.

What just happened?

We just completed the process of developing potential regression models and comparing them in order to choose the best one for our analysis. Through this process, we determined that the successful execution, duration, and the interaction between the number of Shu and Wei soldiers engaged were statistically significant independent variables, whereas the number of Shu and Wei soldiers alone were not. By using an AIC test, we were able to determine that the model containing all three statistically significant variables was best for predicting the Shu army's performance in fire attacks. Therefore, our final regression equation is as follows:

Rating = 37 + 56 * execution - 1.24 * duration - 0.00000013 *
soldiers interaction

Note

For a more detailed discussion of model development, refer to the Regression section of Chapter 5.

glm(...)

Each of our models in this chapter were created using the glm(formula, data) function. For our purposes, this function is identical in structure and very similar in effect to the lm(formula, data) function that we are already familiar with from Chapter 5. We used glm(formula, data) here to demonstrate an alternative R function for creating regression models. In your own work, the appropriate function will be determined by the requirements of your analysis.

You may also have noticed that our lm(formula, data) functions listed only the variable names in the formula argument. This is a short-hand method for referring to our dataset's column names, as demonstrated by the following code:

lmFireRating_ExecutionDuration <- glm(Rating ~
SuccessfullyExecuted + DurationInDays, data = subsetFire)

Notice that the subsetFire$ prefix is absent from each variable name and that the data argument has been defined as subsetFire. When the data argument is used, and the independent variables in the formula argument are unique, the dataset$ prefix may be omitted. This technique has the effect of keeping our code more readable, without changing the results of our calculations.

AIC(object, ...)

AIC can be used to compare regression models. It yields a series of AIC values, which indicate how well our models fit our data. AIC is used to compare multiple models relative to each other, whereby the model with the lowest AIC value best represents our data.

Similar in structure to the anova(object, ...) function, the AIC(object, ...) function accepts a series of objects (regression models in our case) as input. For example, in AIC(A, B, C) we are telling R to compare three objects (A, B, and C) using AIC. Thus, our AIC function compared the four regression models that we created:

> AIC(lmFireRating_Execution, lmFireRating_ExecutionDuration,
lmFireRating_ExecutionDurationSoldiers,
lmFireRating_ExecutionDurationShuWeiInteraction)

As output, AIC(object, ...) returned a series of AIC values used to compare our models.

Recall that we compared our regression models in Chapter 5 using anova(object, ...). To demonstrate an alternative R function that can be used to compare models, we used AIC(object, ...) in this activity. The glm(...) function coordinates well with AIC(object, ...), hence our decision to use them together in this example. Again, the appropriate techniques to use in your future analyses should be determined by the specific conditions surrounding your work.

Pop quiz

  1. When can the dataset$ prefix be omitted from the variables in the formula argument of lm(formula, data) and glm(formula, data)?

    a. When the data argument is defined.

    b. When the data argument is defined and all of the variables come from different datasets.

    c. When the data argument is defined and all of the variables have unique names.

    d. When the data argument is defined, all of the variables come from different datasets, and all of the variables have unique names.

  2. Which of the following is not true of the anova(object, ...) and AIC(object, ...) functions?

    a. Both can be used to compare regression models.

    b. Both receive the same arguments.

    c. Both represent different statistical methods.

    d. Both yield identical mathematical results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.251.57