Time for action - data exploration

To begin our analysis, we will examine the summary statistics and correlations of our data. These will give us an overview of the data and inform our subsequent analyses:

  1. Generate a summary of the fire attack subset using summary(object):
    > #generate a summary of the fire subset
    > summaryFire <- summary(subsetFire)
    > #display the summary
    > summaryFire
    
    Time for action - data exploration
  2. Before calculating correlations, we will have to convert our nonnumeric data from the Method, SuccessfullyExecuted, and Result columns into numeric form.

    Note

    For a discussion on converting nonnumeric data, refer to the Quantifying Categorical Variables section of Chapter 4.

  3. Recode the Method column using as.numeric(data):
    > #represent categorical data numerically using
    as.numeric(data)
    > #recode the Method column into Fire = 1
    > numericMethodFire <- as.numeric(subsetFire$Method) - 1
    
  4. Recode the SuccessfullyExecuted column using as.numeric(data):
    > #recode the SuccessfullyExecuted column into N = 0 and Y = 1
    > numericExecutionFire <-
    as.numeric(subsetFire$SuccessfullyExecuted) - 1
    
  5. Recode the Result column using as.numeric(data):
    > #recode the Result column into Defeat = 0 and Victory = 1
    > numericResultFire <- as.numeric(subsetFire$Result) - 1
    
  6. With the Method, SuccessfullyExecuted, and Result columns coded into numeric form, let us now add them back into our fire dataset.
  7. Save the data in our recoded variables back into the original dataset:
    > #save the data in the numeric Method, SuccessfullyExecuted,
    and Result columns back into the fire attack dataset
    > subsetFire$Method <- numericMethodFire
    > subsetFire$SuccessfullyExecuted <- numericExecutionFire
    > subsetFire$Result <- numericResultFire
    
  8. Display the numeric version of the fire attack subset. Notice that all of the columns now contain numeric data; it will look like the following:
    Time for action - data exploration
  9. Having replaced our original text values in the SuccessfullyExecuted and Result columns with numeric data, we can now calculate all of the correlations in the dataset using the cor(data) function:
    > #use cor(data) to calculate all of the correlations in the
    fire attack dataset
    > cor(subsetFire)
    
    Time for action - data exploration

    Note

    Note that the error message and NA values in our correlation output result from the fact that our Method column contains only a single value. This is irrelevant to our analysis and can be ignored.

What just happened?

Initially, we calculated summary statistics for our fire attack dataset using the summary(object) function. From this information, we can derive the following useful insights about our past battles:

  • The rating of the Shu army's performance in fire attacks has ranged from 10 to 100, with a mean of 45
  • Fire attack plans have been successfully executed 10 out of 30 times (33%)
  • Fire attacks have resulted in victory 8 out of 30 times (27%)
  • Successfully executed fire attacks have resulted in victory 8 out of 10 times (80%), while unsuccessful attacks have never resulted in victory
  • The number of Shu soldiers engaged in fire attacks has ranged from 100 to 10,000 with a mean of 2,052
  • The number of Wei soldiers engaged in fire attacks has ranged from 1,500 to 50,000 with a mean of 12,333
  • The duration of fire attacks has ranged from 1 to 14 days with a mean of 7

Next, we recoded the text values in our dataset's Method, SuccessfullyExecuted, and Result columns into numeric form. After adding the data from these variables back into our our original dataset, we were able to calculate all of its correlations. This allowed us to learn even more about our past battle data:

  • The performance rating of a fire attack has been highly correlated with successful execution of the battle plans (0.92) and the battle's result (0.90), but not strongly correlated with the other variables.
  • The execution of a fire attack has been moderately negatively correlated with the duration of the attack, such that a longer attack leads to a lesser chance of success (-0.46).
  • The numbers of Shu and Wei soldiers engaged are highly correlated with each other (0.74), but not strongly correlated with the other variables.

The insights gleaned from our summary statistics and correlations put us in a prime position to begin developing our regression model.

Pop quiz

  1. Which of the following is a benefit of adding a text variable back into its original dataset after it has been recoded into numeric form?

    a. Calculation functions can be executed on the recoded variable.

    b. Calculation functions can be executed on the other variables in the dataset.

    c. Calculation functions can be executed on the entire dataset.

    d. There is no benefit.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.20.236