Predicting complex skill learning

In this section, we'll have a chance to explore data from an innovative and recent project known as SkillCraft. The interested reader can find out more about this project on the Web by going to http://skillcraft.ca/. The key premise behind the project is that by studying the performance of players in a real-time strategy (RTS) game that involves complex resource management and strategic decisions, we can study how humans learn complex skills and develop speed and competence in dynamic resource allocation scenarios. To achieve this, data has been collected from players playing the popular real-time strategy game, Starcraft 2, developed by Blizzard.

In this game, players compete against each other on one of many fixed maps and starting locations. Each player must choose a fictional race from three available choices and start with six worker units, which are used to collect one of two game resources. These resources are needed in order to build military and production buildings, military units unique to each race, research technologies, and build more worker units. The game involves a mix of economic advancement, military growth, and military strategy in real-time engagements.

Players are pitted against each other via an online matching algorithm that groups players into leagues according to their perceived level of skill. The algorithm's perception of a player's skill changes over time on the basis of that player's performance across the games in which the player participates. There are eight leagues in total, which are uneven in population in that the lower leagues tend to have more players and the upper leagues have fewer players.

Having a basic understanding of the game, we can download the SkillCraft1 Master Table data set from the UCI Machine Learning repository by going to https://archive.ics.uci.edu/ml/datasets/SkillCraft1+Master+Table+Dataset. The rows of this data set are individual games that are played and the features of the games are metrics of a player's playing speed, competence, and decision-making. The authors of the data set have used both standard performance metrics familiar to players of the game, as well as other metrics such as Perception Action Cycles (PACs), which attempt to quantify a player's actions at the fixed location on the map at which a player is looking during a particular time window.

The task at hand is to predict which of the eight leagues a player is currently assigned to on the basis of these performance metrics. Our output variable is an ordered categorical variable because we have eight distinct leagues ordered from 1 to 8, where the latter corresponds to the league with players of the highest skill.

One possible way to deal with ordinal outputs is to treat them as a numeric variable, modeling this as a regression task, and build a regression tree. The following table describes the features and output variables that we have in our data set:

Feature name

Type

Description

Age

Numeric

Player's age

HoursPerWeek

Numeric

Reported hours spent playing per week

TotalHours

Numeric

Reported total hours ever spent playing

APM

Numeric

Game actions per minute

SelectByHotkeys

Numeric

Number of unit or building selections made using hotkeys per timestamp

AssignToHotkeys

Numeric

Number of units or buildings assigned to hotkeys per timestamp

UniqueHotkeys

Numeric

Number of unique hotkeys used per timestamp

MinimapAttacks

Numeric

Number of attack actions on minimap per timestamp

MinimapRightClicks

Numeric

Number of right-clicks on minimap per timestamp

NumberOfPACs

Numeric

Number of PACs per timestamp

GapBetweenPACs

Numeric

Mean duration in milliseconds between PACs

ActionLatency

Numeric

Mean latency from the onset of a PAC to their first action in milliseconds

ActionsInPAC

Numeric

Mean number of actions within each PAC

TotalMapExplored

Numeric

The number of 24x24 game coordinate grids viewed by the player per timestamp

WorkersMade

Numeric

Number of worker units trained per timestamp

UniqueUnitsMade

Numeric

Unique units made per timestamp

ComplexUnitsMade

Numeric

Number of complex units trained per timestamp

ComplexAbilitiesUsed

Numeric

Abilities requiring specific targeting instructions used per timestamp

LeagueIndex

Numeric

Bronze, Silver, Gold, Platinum, Diamond, Master, GrandMaster, and Professional leagues coded 1-8 (output)

Tip

If the reader has never played a real-time strategy game like Starcraft 2 on a computer before, it is likely that many of the features used by the data set will sound arcane. If one simply takes it on board that these features represent various aspects of a player's level of performance in the game, it will still be possible to follow all the discussion surrounding the training and testing of our regression tree without any difficulty.

To start with, we load this data set onto the data frame skillcraft. Before beginning to work with the data, we will have to do some preprocessing. Firstly, we'll drop the first column. This simply has a unique game identifier that we don't need and won't use. Secondly, a quick inspection of the imported data frame will show that three columns have been interpreted as factors because the input data set contains a question mark to denote a missing value. To deal with this, we first need to convert these columns to numeric columns, a process that will introduce missing values in our data set.

Next, although we've seen that trees are quite capable of handling these missing values, we are going to remove the few rows that have them. We will do this because we want to be able to compare the performance of several different models in this chapter and in the next, not all of which support missing values. Here is the code for the preprocessing steps just described:

> skillcraft <- read.csv("SkillCraft1_Dataset.csv")
> skillcraft <- skillcraft[-1]
> skillcraft$TotalHours <- as.numeric(
  levels(skillcraft$TotalHours))[skillcraft$TotalHours]
Warning message:
NAs introduced by coercion 
> skillcraft$HoursPerWeek <- as.numeric(
  levels(skillcraft$HoursPerWeek))[skillcraft$HoursPerWeek]
Warning message:
NAs introduced by coercion 
> skillcraft$Age <- as.numeric(
  levels(skillcraft$Age))[skillcraft$Age]
Warning message:
NAs introduced by coercion 
> skillcraft <- skillcraft[complete.cases(skillcraft),]

As usual, the next step will be to split our data into training and test sets:

> library(caret)
> set.seed(133)
> skillcraft_sampling_vector <- createDataPartition( 
  skillcraft$LeagueIndex, p = 0.80, list = FALSE)
> skillcraft_train <- skillcraft[skillcraft_sampling_vector,]
> skillcraft_test <- skillcraft[-skillcraft_sampling_vector,]

This time, we will use the rpart package in order to build our decision tree (along with the tree package, these two are the most commonly used packages for building tree-based models in R). This package provides us with an rpart() function to build our tree. Just as with the tree() function, we can build a regression tree using the default behavior by simply providing a formula and our data frame:

> library(rpart)
> regtree <- rpart(LeagueIndex ~ ., data = skillcraft_train)

We can plot our regression tree to see what it looks like:

> plot(regtree, uniform = TRUE)
> text(regtree, use.n = FALSE, all = TRUE, cex = .8)

This is the plot that is produced:

Predicting complex skill learning

To get a sense of the accuracy of our regression tree, we will compute predictions on the test data and then measure the SSE. This can be done with the help of a simple function that we will define, compute_SSE(), which calculates the sum of squared error, when given a vector of target values and a vector of predicted values:

 compute_SSE <- function(correct, predictions) {
     return(sum((correct - predictions) ^ 2))
 }
 
> regtree_predictions <- predict(regtree, skillcraft_test)
> (regtree_SSE <- compute_SSE(regtree_predictions, skillcraft_test$LeagueIndex))
[1] 740.0874

Tuning model parameters in CART trees

So far, all we have done is use default values for all the parameters of the recursive partitioning algorithm for building the tree. The rpart() function has a special control parameter to which we can provide an object containing the values of any parameters we wish to override. To build this object, we must use the special rpart.control() function. There are a number of different parameters that we could tweak, and it is worth studying the help file for this function to learn more about them.

Here we will focus on three important parameters that affect the size and complexity of our tree. The minsplit parameter holds the minimum number of data points that are needed in order for the algorithm to attempt a split before it is forced to create a leaf node. The default value is 30. The cp parameter is the complexity parameter we have seen before and the default value of this is 0.01. Finally, the maxdepth parameter limits the maximum number of nodes between a leaf node and the root node. The default value of 30 is quite liberal here, allowing for fairly large trees to be built. We can try out a different regression tree by specifying some values for these that are different from their default. We'll do this, and see if this affects the SSE performance on our test set:

> regtree.random <- rpart(LeagueIndex ~ ., data = skillcraft_train, 
  control = rpart.control(minsplit = 20, cp = 0.001, maxdepth = 10))
> regtree.random_predictions <- predict(regtree.random, 
  skillcraft_test)
> (regtree.random_SSE <- compute_SSE(regtree.random_predictions, 
   skillcraft_test$LeagueIndex))
[1] 748.6157

Using these values we are trying to limit the tree to a depth of 10, while making it easier to force a split by needing 20 or more data points at a node. We are also lowering the effect of regularization by setting the complexity parameter to 0.001. This is a completely random choice that happens to give us a worse SSE value on our test set. In practice, what is needed is a systematic way to find appropriate values of these parameters for our tree by trying out a number of different combinations and using cross-validation as a way to estimate their performance on unseen data.

Essentially, we would like to tune our regression tree training and in Chapter 5, Support Vector Machines, we met the tune() function inside the e1071 package, which can help us do just that. We will use this function with rpart() and provide it with ranges for the three parameters we just discussed:

> library(e1071)
> rpart.ranges <- list(minsplit = seq(5, 50, by = 5), cp = c(0,  
  0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2,0.5), maxdepth = 1:10)
> (regtree.tune <- tune(rpart,LeagueIndex ~ ., 
   data = skillcraft_train, ranges = rpart.ranges))

Parameter tuning of 'rpart':

- sampling method: 10-fold cross validation 

- best parameters:
 minsplit    cp maxdepth
       35 0.002        6

- best performance: 1.046638

Running the preceding tasks will likely take several minutes to complete, as there are many combinations of parameters. Once the procedure completes, we can train a tree with the suggested values:

> regtree.tuned <- rpart(LeagueIndex ~ ., data = skillcraft_train,  
  control = rpart.control(minsplit = 35, cp = 0.002, maxdepth = 6))
> regtree.tuned_predictions <- predict(regtree.tuned, 
  skillcraft_test)
> (regtree.tuned_SSE <- compute_SSE(regtree.tuned_predictions, 
   skillcraft_test$LeagueIndex))
[1] 701.3386

Indeed, we have a lower SSE value with these settings on our test set. If we type in the name of our new regression tree model, regree.tuned, we'll see that we have many more nodes in our tree, which is now substantially more complex.

Variable importance in tree models

For large trees such as this, plotting is less useful as it is very hard to make the plot readable. One interesting plot that we can obtain is a plot of variable importance. For every input feature, we keep track of the reduction in the optimization criterion (for example, deviance or SSE) that occurs every time it is used anywhere in the tree. We can then tally up this quantity for all the splits in the tree and thus obtain relative amounts of variable importance.

Intuitively, features that are highly important will tend to have been used early to split the data (and hence appear higher up in the tree, closer to the root node) as well as more often. If a feature is never used, then it is not important and in this way we can see that we have a built-in feature selection.

Note that this approach is sensitive to correlation in the features. When trying to determine what feature to split on, we may randomly end up picking between two highly correlated features resulting in the model using more features than necessary and as a result, the importance of these features is lower than if either had been chosen on its own. It turns out that variable importance is automatically computed by rpart() and stored in the variable.importance attribute on the tree model that is returned. Plotting this using barplot() produces the following:

Variable importance in tree models

To an experienced player of the RTS genre, this graph looks quite reasonable and intuitive. The biggest separator of skill according to this graph is the average number of game actions that a player makes in a minute (APM). Experienced and effective players are capable of making many actions whereas less experienced players will make fewer.

At first glance, this may seem to be simply a matter of acquiring so-called muscle memory and developing faster reflexes, but in actuality it is knowing which actions to carry out, and playing with strategy and planning during the game (a characteristic of better players) that also significantly increases this metric.

Another speed-related attribute is the ActionLatency feature, which essentially measures the time between choosing to focus the map on a particular location on the battlefield and executing the first action at that location. Better players will spend less time looking at a map location and will be faster at selecting units, giving orders, and deciding what to do given an image of a situation in the game.

Regression model trees in action

We'll wrap up the experiments in this chapter with a very short demonstration of how to run a regression model tree in R. We can do this very easily using the RWeka package, which contains the M5P() function. This follows the typical convention of requiring a formula and a data frame with the training data:

> library("RWeka")
> m5treee <- M5P(LeagueIndex ~ ., data = skillcraft_train)
> m5tree_predictions <- predict(m5tree, skillcraft_test)
> m5tree_SSE <- compute_SSE(m5tree_predictions, 
                            skillcraft_test$LeagueIndex)
> m5tree_SSE
[1] 714.8785

Note that we get almost comparable performance to our tuned CART tree using the default settings. We'll leave the readers to explore this function further, but we will be revisiting this data set once again in the next chapter on ensemble methods.

Note

A good reference on regression model trees containing several case studies is the original paper by Quinlan, titled Learning with continuous cases, from the proceedings of the Australian Joint Conference on Artificial Intelligence (1992).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.255.86