Time for action - modelling with simple linear regression

Simple linear regression is the most basic form of regression analysis. It uses a single independent variable to predict the outcome of a single dependent variable.

To begin experimenting with regression analysis in R, let us create a simple linear model from our head to head combat data:

  1. Use the lm(formula, data) function to create a linear regression model where Rating is the dependent variable and ShuSoldiersEngaged is the independent variable. This is done as follows:
    > #create a linear regression model using the lm(formula, data)
    > #predict the rating of a head to head battle using the number
    of Shu soldiers engaged
    > lmHeadToHeadRating_ShuSoldiers <- lm(subsetHeadToHead$Rating ~
    subsetHeadToHead$ShuSoldiersEngaged, subsetHeadToHead)
    
  2. Display the contents of your linear model variable in the R console:
    > #display the contents of the model
    > lmHeadToHeadRating_ShuSoldiers
    
    Time for action - modelling with simple linear regression
  3. Create a summary of the model, as follows:
    > #create the model summary
    > lmHeadToHeadRating_ShuSoldiers_Summary <-
    summary(lmHeadToHeadRating_ShuSoldiers)
    
  4. Display the contents of your linear model summary in the R console:
    > #display the model summary
    > lmHeadToHeadRating_ShuSoldiers_Summary
    
    Time for action - modelling with simple linear regression

What just happened?

Your first linear regression model yielded quite a bit of information. Let us look at how to use the lm(formula, data) function as well as how to interpret the information that it provides to us.

lm(formula, data)

The lm(formula, data) function is used to create a linear regression model. The formula argument takes on the following structure:

dVar ~ iVar1 + iVar2 + ... + iVarn

Here, dVar is the dependent variable and iVar1 through iVarn are independent variables. While our initial model used a single independent variable, the linear model function is capable of accepting as many as we need. The data argument contains the dataset from which our variables are taken. Hence, the basic composition of the lm(formula, data) function resembles the following:

lm(dVar ~ iVar1 + iVar2 + ... + iVarn, data)

In our simple linear regression model, Rating acted as the dependent variable and ShuSoldiersEngaged took on the role of the independent variable, as shown:

> lmHeadToHeadRating_ShuSoldiers <- lm(subsetHeadToHead$Rating ~
subsetHeadToHead$ShuSoldiersEngaged, subsetHeadToHead)

Linear model output

Together, we formed a linear model that regressed the Shu army's head to head combat performance rating (the dependent or predicted variable) on the number of Shu soldiers engaged in battle (the independent or predictor variable). When we called our linear model variable, we received the following output from the R console:

Linear model output

This output consists of two sections. In Call:, we see a reiteration of the console line that R used to create the model. In Coefficients:, we see both an intercept and a coefficient for the number of Shu soldiers engaged. The latter two items help us to create a regression equation. Typically, a regression equation takes on the following form:

Y = b0 + b1X1 + b2X2 + ... + bnXn

In this equation, Y is the dependent variable, b0 is the intercept, and b1X1 through bnXn are independent variables. Thus, the equation for our model is as follows:

Rating = 31 + 0.00044 * number of Shu soldiers

Linear model summary

After displaying the model output, you also created a more detailed summary using the summary(object) function

> lmHeadToHeadRating_ShuSoldiers_Summary <-
summary(lmHeadToHeadRating_ShuSoldiers)

Note

For a discussion of the summary(object) function, revisit the Deriving summary statistics section of this chapter.

Linear model summary

Again, you have witnessed the value and versatility of the summary(object) function, as it adapted itself to generate output relevant to our regression model. In the output, you can see the same intercept and independent variable coefficients (Estimate column) that we derived from the default model output. However, you are also exposed to a wealth of additional information about the model. In fact, nearly everything you would need to know for a data analysis is included. For our interpretations, we will focus on the Coefficients:, Multiple R-squared, and p-value/Pr(>|t|) portions of the output.

In case you need to be refreshed on the meaning of R-squared and p-values, we will briefly review them here:

  • R-squared (Multiple R-squared in the summary output) tells us how well our linear model fits our data, and thus, how much predictive power our model has. Technically, it is the percentage of variance in the dependent variable that is accounted for by a regression model. For example, the R-squared of our linear model tells us how much of the variance in the performance rating of a head to head conflict can be accounted for by the number of Shu soldiers engaged in that battle.
  • A p-value (Pr(>|t|) and p-value in the summary output) is an indicator of statistical significance. In common practice, a cutoff 0.05 is used to determine statistical significance. Both individual coefficients and the overall linear model have p-values. In general, it is better to have significant coefficients and models, because statistical significance indicates that our results are more likely to be genuine and unlikely to have occurred by random chance. Yet, statistical significance is not the be all and end all of data analysis. Since data do not think nor act, one must always remember to consider the practical implications of statistical findings. We will also remain diligent in assessing the practical significance of our work throughout this book.

Interpreting a linear regression model

Sound interpretation is essential to understanding the practical ramifications of our data analyses. Recall that our linear regression analysis yielded the following equation:

Y = 31 + 0.00044 * X1

Or in words:

Rating = 31 + 0.00044 * number of Shu soldiers

Look back at the Rating column of our original battle history dataset. Rating can take on a value between 0 and 100. Since we are interested in predicting the Shu army's performance, the closer our equation comes to 100, the more confident we will be that our battle plans will lead to victory. Conversely, the lower our predicted performance, the less confident we can be that our strategy is going to lead to beneficial outcomes.

In fact, it is clear from our data that Zhuge Liang rated the army's performance at or above 80 in victorious battles, whereas he rated the army lower in conflicts that resulted in defeat. Therefore, 80 is a good rating threshold to keep in mind when predicting future battle performance. In general, we want to devise strategies that will predict a performance of 80 or higher.

A model's intercept is interpreted as the value of the dependent variable when all independent variables are equal to zero. The intercept of a linear regression model often does not have an intuitive meaning. In our case, the intercept of 31 suggests that our performance will somehow be greater than zero even if we do not send soldiers into battle. Nevertheless, the intercept impacts our overall model and is important for making predictions.

Our coefficient for the number of Shu soldiers engaged is 0.00044. As you can imagine, it would take quite a large force to predict a sufficient performance rating for victory using our model. This notion is demonstrated by the following calculation, which solves for the number of soldiers necessary to predict a rating of 80:

80 = 31 + 0.00044 * X1
49 = 0.00044 * X1
X1 = 111,364 soldiers needed to predict victory!

This suggests that over half of the entire Shu army of 200,000 would need to participate in a single battle just to reach our minimum rating threshold. Yet, recall that our current model only deals with head to head combat performance and only uses the number of Shu soldiers engaged to predict it.

While both our coefficient and overall model are statistically significant with p-values of 0.02, there is much that is left unexplained. This is evident when considering our R-squared value of 0.18. This value means that only 18% of the variance in performance rating can be explained by our model. In a practical sense, this can be interpreted as saying that only 18% of the rating of a head to head battle can be accounted for by the number of Shu soldiers engaged.

All in all, our interpretations indicate that the current model is not effective enough at predicting the Shu army's performance. Clearly, there are many other factors that account for performance besides the number of soldiers that we send into battle. Thankfully, we have a dataset that contains rich battle history information and the ability to form more complex multiple linear regression models. Thus, the analysis of our battle data has just begun.

Pop quiz

  1. Which of the following represents proper syntax for use in the formula argument of the lm(formula, data) function?

    a. Y ~ X1 - X2

    b. Y ~ X1 + X2

    c. X ~ Y1 + Y2

    d. X ~ Y1 - Y2

  2. In the following linear regression equation, identify the dependent variable, independent variable, intercept, and coefficient: Y = 0.5 + 3 * X

    a. Y is the dependent variable, X is the independent variable, 0.5 is the intercept, and 3 is the coefficient.

    b. X is the dependent variable, Y is the independent variable, 0.5 is the intercept, and 3 is the coefficient.

    c. Y is the dependent variable, X is the independent variable, 3 is the intercept, and 0.5 is the coefficient.

    d. Y is the dependent variable, 3 is the independent variable, 0.5 is the intercept, and X is the coefficient.

  3. Interpret the following linear regression equation: Y = 5 - 10 * X

    a. The predicted value of Y is equal to 5 plus 10 times X.

    b. The value of Y is equal to 5 plus 10 times X.

    c. The predicted value of Y is equal to 5 minus 10 times X.

    d. The value of Y is equal to 5 minus 10 times X.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.68.28