Chapter 4. Predicting Forest Cover with Decision Trees

Classification and regression are the oldest and most well-studied types of predictive analytics. Most algorithms you will likely encounter in analytics packages and libraries are classification or regression techniques, like support vector machines, logistic regression, neural networks, and deep learning. The common thread linking regression and classification is that both involve predicting one (or more) values given one (or more) other values. To do so, both require a body of inputs and outputs to learn from. They need to be fed both questions and known answers. For this reason, they are known as types of supervised learning.

PySpark MLlib offers implementations of a number of classification and regression algorithms. These include decision trees, naïve bayes, logistic regression, and linear regression. The exciting thing about these algorithms is that, with all due respect to Mr. Bohr, they can help predict the future—or at least, predict the things we don’t yet know for sure, like the likelihood you will buy a car based on your online behavior, whether an email is spam given its words, or which acres of land are likely to grow the most crops given their location and soil chemistry.

In this chapter, we will focus on a popular and flexible type of algorithm for both classification and regression: decision trees, and the algorithm’s extension, random decision forests. First, we will will understand the basics of Decision Trees and Forests, and introduce the former’s PySpark implementation. This will be followed by preparation of our dataset, and creating our first decision tree. Then we will tune our decision tree model. We will finish off by training a random forest model on our processed dataset and making predictions.

Although PySpark’s decision tree implementation is easy to get started with, it is helpful to understand the fundamentals of Decision Tree and Random Forest algorithms. That is what we will cover in the next section.

Decision Trees and Forests

Decision trees are a family of algorithms that can naturally handle both categorical and numeric features. Building a single tree can be done in parallel, and many trees can be built in parallel at once. They are robust to outliers in the data, meaning that a few extreme and possibly erroneous data points might not affect predictions at all. They can consume data of different types and on different scales without the need for preprocessing or normalization, which is an issue that will reappear in chapter 7.

The PySpark implementation of decision trees supports binary and multiclass classification, and regression. The implementation partitions data by rows, allowing distributed training with millions or even billions of instances. Decision tree–based algorithms have the further advantage of being comparatively intuitive to understand and reason about. In fact, we all probably use the same reasoning embodied in decision trees, implicitly, in everyday life. For example, I sit down to have morning coffee with milk. Before I commit to that milk and add it to my brew, I want to predict: is the milk spoiled? I don’t know for sure. I might check if the use-by date has passed. If not, I predict no, it’s not spoiled. If the date has passed, but that was three or fewer days ago, I take my chances and predict no, it’s not spoiled. Otherwise, I sniff the milk. If it smells funny, I predict yes, and otherwise no.

This series of yes/no decisions that lead to a prediction are what decision trees embody. Each decision leads to one of two results, which is either a prediction or another decision, as shown in Figure 4-1. In this sense, it is natural to think of the process as a tree of decisions, where each internal node in the tree is a decision, and each leaf node is a final answer.

aas2 0401
Figure 4-1. Decision tree: is it spoiled?

That is a simplistic decision tree, and was not built with any rigor. To elaborate, consider another example. A robot has taken a job in an exotic pet store. It wants to learn, before the shop opens, which animals in the shop would make a good pet for a child. The owner lists nine pets that would and wouldn’t be suitable before hurrying off. The robot compiles the information found in Table 4-1 from examining the animals

Table 4-1. Exotic pet store “feature vectors”
Name Weight (kg) # Legs Color Good pet?

Fido

20.5

4

Brown

Yes

Mr. Slither

3.1

0

Green

No

Nemo

0.2

0

Tan

Yes

Dumbo

1390.8

4

Gray

No

Kitty

12.1

4

Gray

Yes

Jim

150.9

2

Tan

No

Millie

0.1

100

Brown

No

McPigeon

1.0

2

Gray

No

Spot

10.0

4

Brown

Yes

Although a name is given, it will not be included as a feature. There is little reason to believe the name alone is predictive; “Felix” could name a cat or a poisonous tarantula, for all the robot knows. So, there are two numeric features (weight, number of legs) and one categorical feature (color) predicting a categorical target (is/is not a good pet for a child.

The robot might try to fit a simple decision tree to this training data to start, consisting of a single decision based on weight, as shown in Figure 4-2.

aas2 0402
Figure 4-2. Robot’s first decision tree

The logic of the decision tree is easy to read and make some sense of: 500kg animals certainly sound unsuitable as pets. This rule predicts the correct value in five of nine cases. A quick glance suggests that we could improve the rule by lowering the weight threshold to 100kg. This gets six of nine examples correct. The heavy animals are now predicted correctly; the lighter animals are only partly correct.

So, a second decision can be constructed to further refine the prediction for examples with weights less than 100kg. It would be good to pick a feature that changes some of the incorrect Yes predictions to No. For example, there is one small green animal, sounding suspiciously like a snake, that the robot could predict correctly by deciding on color, as shown in Figure 4-3.

aas2 0403
Figure 4-3. Robot’s next decision tree

Now, seven of nine examples are correct. Of course, decision rules could be added until all nine were correctly predicted. The logic embodied in the resulting decision tree would probably sound implausible when translated into common speech: “If the animal’s weight is less than 100kg, and its color is brown instead of green, and it has fewer than 10 legs, then yes it is a suitable pet.” While perfectly fitting the given examples, a decision tree like this would fail to predict that a small, brown, four-legged wolverine is not a suitable pet. Some balance is needed to avoid this phenomenon, known as overfitting.

Decision trees generalize into a more powerful algorithm, called random forests. Random forests combine many decision trees in order to reduce the risk of overfitting and train the decision trees separately. The algorithm injects randomness into the training process so that each decision tree is a bit different. Combining the predictions reduces the variance of the predictions, makes the resulting model more generalizable and improves performance on test data.

This is enough of an introduction to decision trees and random forests for us to begin using them with PySpark. In the next section, we will introduce the dataset that we’ll work with and prepare it for use in PySpark.

Preparing the Data

The data set used in this chapter is the well-known Covtype data set, available online as a compressed CSV-format data file, covtype.data.gz, and accompanying info file, covtype.info.

The data set records the types of forest-covering parcels of land in Colorado, USA. It’s only a coincidence that the data set concerns real-world forests! Each example contains several features describing each parcel of land—like its elevation, slope, distance to water, shade, and soil type—along with the known forest type covering the land. The forest cover type is to be predicted from the rest of the features, of which there are 54 in total.

This data set has been used in research and even a Kaggle competition. It is an interesting data set to explore in this chapter because it contains both categorical and numeric features. There are 581,012 examples in the data set, which does not exactly qualify as big data but is large enough to be manageable as an example and still highlight some issues of scale.

Thankfully, the data is already in a simple CSV format and does not require much cleansing or other preparation to be used with PySpark MLlib. Later, it will be of interest to explore some transformations of the data, but it can be used as is to start.

The covtype.data file should be extracted and copied into HDFS. Start pyspark-shell. You may find it helpful to give the shell a healthy amount of memory to work with, as building decision forests can be resource-intensive. If you have the memory, specify --driver-memory 8g or similar.

CSV files contain fundamentally tabular data, organized into rows of columns. Sometimes these columns are given names in a header line, although that’s not the case here. The column names are given in the companion file, covtype.info. Conceptually, each column of a CSV file has a type as well—a number, a string—but a CSV file doesn’t specify this.

It’s natural to parse this data as a data frame because this is PySpark’s abstraction for tabular data, with a defined column schema, including column names and types. PySpark has built-in support for reading CSV data, in fact:

dataWithoutHeader = spark.read.option("inferSchema", True).option("header", False).csv("covtype.data")
dataWithoutHeader.printSchema()
...
root
 |-- _c0: integer (nullable = true)
 |-- _c1: integer (nullable = true)
 |-- _c2: integer (nullable = true)
 |-- _c3: integer (nullable = true)
 |-- _c4: integer (nullable = true)
 |-- _c5: integer (nullable = true)
 ...

This code reads the input as CSV and does not attempt to parse the first line as a header of column names. It also requests that the type of each column be inferred by examining the data. It correctly infers that all of the columns are numbers, and more specifically, integers. Unfortunately it can only name the columns “_c0” and so on.

Looking at the column names, it’s clear that some features are indeed numeric. “Elevation” is an elevation in meters; “Slope” is measured in degrees. However, “Wilderness_Area” is something different, because it is said to span four columns, each of which is a 0 or 1. In reality, “Wilderness_Area” is a categorical value, not a numeric one.

These four columns are actually a one-hot or 1-of-n encoding, in which one categorical feature that takes on N distinct values becomes N numeric features, each taking on the value 0 or 1. Exactly one of the N values has value 1, and the others are 0. For example, a categorical feature for weather that can be cloudy, rainy, or clear would become three numeric features, where cloudy is represented by 1,0,0; rainy by 0,1,0; and so on. These three numeric features might be thought of as is_cloudy, is_rainy, and is_clear features. Likewise, 40 of the columns are really one Soil_Type categorical feature.

This isn’t the only possible way to encode a categorical feature as a number. Another possible encoding simply assigns a distinct numeric value to each possible value of the categorical feature. For example, cloudy may become 1.0, rainy 2.0, and so on. The target itself, “Cover_Type”, is a categorical value encoded as a value 1 to 7.

Be careful when encoding a categorical feature as a single numeric feature. The original categorical values have no ordering, but when encoded as a number, they appear to. Treating the encoded feature as numeric leads to meaningless results because the algorithm is effectively pretending that rainy is somehow greater than, and two times larger than, cloudy. It’s OK as long as the encoding’s numeric value is not used as a number.

So we see both types of encodings of categorical features. It would have, perhaps, been simpler and more straightforward to not encode such features (and in two ways, no less), and instead simply include their values directly like “Rawah Wilderness Area.” This may be an artifact of history; the data set was released in 1998. For performance reasons or to match the format expected by libraries of the day, which were built more for regression problems, data sets often contain data encoded in these ways.

In any event, before proceeding, it is useful to add column names to this DataFrame in order to make it easier to work with:

from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col

colNames = ["Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways", "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", "Horizontal_Distance_To_Fire_Points"] +  1
[f"Wilderness_Area_{i}" for i in range(4)] + 
[f"Soil_Type_{i}" for i in range(40)] + 
["Cover_Type"]

data = dataWithoutHeader.toDF(*colNames).withColumn("Cover_Type", col("Cover_Type").cast(DoubleType()))

data.head()
...
Row(Elevation=2596, Aspect=51, Slope=3, Horizontal_Distance_To_Hydrology=258, Vertical_Distance_To_Hydrology=0, Horizontal_Distance_To_Roadways=510, Hillshade_9am=221, Hillshade_Noon=232, Hillshade_3pm=148, ...)
1

++ concatenates collections

The wilderness- and soil-related columns are named “Wilderness_Area_0”, “Soil_Type_0”, and a bit of Python can generate these 44 names without having to type them all out. Finally, the target “Cover_Type” column is cast to a double value upfront, because it will actually be necessary to consume it as a double rather than int in all PySpark MLlib APIs. This will become apparent later.

You can call data.show() to see some rows of the data set, but the display is so wide that it will be difficult to read at all. data.head() displays it as a raw Row object, which will be more readable in this case.

Now that we’re familiar with our dataset and have processed it, we can train a decision tree model in the next section.

Our First Decision Tree

In chapter 3, we built a recommender model right away on all of the available data. This created a recommender that could be sense-checked by anyone with some knowledge of music: looking at a user’s listening habits and recommendations, we got some sense that it was producing good results. Here, that is not possible. We would have no idea how to make up a new 54-feature description of a new parcel of land in Colorado or what kind of forest cover to expect from such a parcel.

Instead, we must jump straight to holding out some data for purposes of evaluating the resulting model. Before, the AUC metric was used to assess the agreement between held-out listening data and predictions from recommendations. The principle is the same here, although the evaluation metric will be different: accuracy. The majority—90%—of the data will again be used for training, and later, we’ll see that a subset of this training set will be held out for cross-validation (the CV set). The other 10% held out here is actually a third subset, a proper test set.

(trainData, testData) = data.randomSplit([0.9, 0.1])
trainData.cache()
testData.cache()

The data needs a little more preparation to be used with a classifier in MLlib. The input DataFrame contains many columns, each holding one feature that could be used to predict the target column. MLlib requires all of the inputs to be collected into one column, whose value is a vector. This class is an abstraction for vectors in the linear algebra sense, and contains only numbers. For most intents and purposes, they work like a simple array of double values (floating-point numbers). Of course, some of the input features are conceptually categorical, even if they’re all represented with numbers in the input.

Fortunately, the VectorAssembler class can do this work:

from pyspark.ml.feature import VectorAssembler

inputCols = colNames[:-1] 1
vector_assembler = VectorAssembler(inputCols=inputCols,outputCol="featureVector")

assembledTrainData = vector_assembler.transform(trainData)

assembledTrainData.select("featureVector").show(truncate = False)
...
+------------------------------------------------------------------- ...
|featureVector                                                       ...
+------------------------------------------------------------------- ...
|(54,[0,1,2,3,4,5,6,7,8,9,13,15],[1863.0,37.0,17.0,120.0,18.0,90.0,2 ...
|(54,[0,1,2,5,6,7,8,9,13,18],[1874.0,18.0,14.0,90.0,208.0,209.0,135. ...
|(54,[0,1,2,3,4,5,6,7,8,9,13,18],[1879.0,28.0,19.0,30.0,12.0,95.0,20 ...
...
1

Excludes the label, Cover_Type

Its key parameters are the columns to combine into the feature vector, and the name of the new column containing the feature vector. Here, all columns—except—the target, of course—are included as input features. The resulting DataFrame has a new “featureVector” column, as shown.

The output doesn’t look exactly like a sequence of numbers, but that’s because this shows a raw representation of the vector, represented as a SparseVector instance to save storage. Because most of the 54 values are 0, it only stores nonzero values and their indices. This detail won’t matter in classification.

VectorAssembler is an example of Transformer within the current MLlib “Pipelines” API. It transforms another DataFrame into a DataFrame, and is composable with other transformations into a pipeline. Later in this chapter, these transformations will be connected into an actual Pipeline. Here, the transformation is just invoked directly, which is sufficient to build a first decision tree classifier model.

from pyspark.ml.classification import DecisionTreeClassifier

classifier = DecisionTreeClassifier(seed = 1234, labelCol="Cover_Type", featuresCol="featureVector", predictionCol="prediction")

model = classifier.fit(assembledTrainData)
print(model.toDebugString)
...
DecisionTreeClassificationModel: uid=DecisionTreeClassifier_d44ee1f6587c, depth=5, numNodes=41, numClasses=8, numFeatures=54
  If (feature 0 <= 3036.5)
   If (feature 0 <= 2545.5)
    If (feature 10 <= 0.5)
     If (feature 0 <= 2414.5)
      If (feature 3 <= 15.0)
       Predict: 4.0
      Else (feature 3 > 15.0)
       Predict: 3.0
     Else (feature 0 > 2414.5)
      Predict: 3.0
       ...

Again, the essential configuration for the classifier consists of column names: the column containing the input feature vectors and the column containing the target value to predict. Because the model will later be used to predict new values of the target, it is given the name of a column to store predictions.

Printing a representation of the model shows some of its tree structure. It consists of a series of nested decisions about features, comparing feature values to thresholds. (Here, for historical reasons, the features are only referred to by number, not name, unfortunately.)

Decision trees are able to assess the importance of input features as part of their building process. That is, they can estimate how much each input feature contributes to making correct predictions. This information is simple to access from the model.

import pandas as pd

pd.DataFrame(model.featureImportances.toArray(), index=inputCols, columns=['importance']).sort_values(by="importance", ascending=False)
...
                                    importance
Elevation                             0.826946
Hillshade_Noon                        0.029365
Soil_Type_1                           0.028276
Soil_Type_3                           0.026778
Horizontal_Distance_To_Hydrology      0.024833
Wilderness_Area_0                     0.024618
Soil_Type_31                          0.018371
Wilderness_Area_2                     0.012516
Horizontal_Distance_To_Roadways       0.003654
Hillshade_9am                         0.002906
Horizontal_Distance_To_Fire_Points    0.001737
Soil_Type_28                          0.000000
...

This pairs importance values (higher is better) with column names and prints them in order from most to least important. Elevation seems to dominate as the most important feature; most features are estimated to have virtually no importance when predicting the cover type!

The resulting DecisionTreeClassificationModel is itself a transformer because it can transform a data frame containing feature vectors into a data frame also containing predictions.

For example, it might be interesting to see what the model predicts on the training data, and compare its prediction with the known correct cover type.

predictions = model.transform(assembledTrainData)
predictions.select("Cover_Type", "prediction", "probability").show(10, truncate = False)

...
+----------+----------+------------------------------------------------ ...
|Cover_Type|prediction|probability                                      ...
+----------+----------+------------------------------------------------ ...
|6.0       |3.0       |[0.0,0.0,0.02477728285077951,0.6444877505567929, ...
|6.0       |4.0       |[0.0,0.0,0.03204807210816224,0.2989484226339509, ...
|6.0       |3.0       |[0.0,0.0,0.02477728285077951,0.6444877505567929, ...
|6.0       |3.0       |[0.0,0.0,0.02477728285077951,0.6444877505567929, ...
...

Interestingly, the output also contains a “probability” column that gives the model’s estimate of how likely it is that each possible outcome is correct. This shows that in these instances, it’s fairly sure the answer is 3 in several cases and quite sure the answer isn’t 1.

Eagle-eyed readers might note that the probability vectors actually have eight values even though there are only seven possible outcomes. The vector’s values at indices 1 to 7 do contain the probability of outcomes 1 to 7. However, there is also a value at index 0, which always shows as probability 0.0. This can be ignored, as 0 isn’t even a valid outcome, as this says. It’s a quirk of representing this information as a vector that’s worth being aware of.

Based on this snippet, it looks like the model could use some work. Its predictions look like they are often wrong. As with the ALS implementation in chapter 3, the DecisionTreeClassifier implementation has several hyperparameters for which a value must be chosen, and they’ve all been left to defaults here. Here, the test set can be used to produce an unbiased evaluation of the expected accuracy of a model built with these default hyperparameters.

MulticlassClassificationEvaluator can compute accuracy and other metrics that evaluate the quality of the model’s predictions. It’s an example of an evaluator in MLlib, which is responsible for assessing the quality of an output DataFrame in some way.

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="Cover_Type", predictionCol="prediction")

evaluator.setMetricName("accuracy").evaluate(predictions)
evaluator.setMetricName("f1").evaluate(predictions)

...
0.699417706840366
0.6825807291644509

After being given the column containing the “label” (target, or known correct output value) and the name of the column containing the prediction, it finds that the two match about 70% of the time. This is the accuracy of this classifier. It can compute other related measures, like the F1 score. For purposes here, accuracy will be used to evaluate classifiers.

This single number gives a good summary of the quality of the classifier’s output. Sometimes, however, it can be useful to look at the confusion matrix. This is a table with a row and a column for every possible value of the target. Because there are seven target category values, this is a 7×7 matrix, where each row corresponds to an actual correct value, and each column to a predicted value, in order. The entry at row i and column j counts the number of times an example with true category i was predicted as category j. So, the correct predictions are the counts along the diagonal and the predictions are everything else.

It’s possible to calculate something like a confusion matrix directly with the DataFrame API, using its more general operators.

confusionMatrix = predictions.groupBy("Cover_Type").pivot("prediction", range(1,8)).count().
  na.fill(0.0). 1
  orderBy("Cover_Type")

confusionMatrix.show()

...

+----------+------+------+-----+---+---+---+-----+
|Cover_Type|     1|     2|    3|  4|  5|  6|    7|
+----------+------+------+-----+---+---+---+-----+
|       1.0|133529| 51669|  111|  0|  0|  0| 5119|
|       2.0| 56784|192506| 4859| 64|  0|  0|  735|
|       3.0|     0|  3357|28225|597|  0|  0|    0|
|       4.0|     0|     0| 1487|948|  0|  0|    0|
|       5.0|     0|  8270|  271|  0|  0|  0|    0|
|       6.0|     0|  3407|11829|388|  0|  0|    0|
|       7.0|  8057|    77|    0|  0|  0|  0|10300|
+----------+------+------+-----+---+---+---+-----+
1

Replace null with 0

Spreadsheet users may have recognized the problem as just like that of computing a pivot table. A pivot table groups values by two dimensions whose values become rows and columns of the output, and compute some aggregation within those groupings, like a count here. This is also available as a PIVOT function in several databases, and is supported by Spark SQL. It’s arguably more elegant and powerful to compute it this way.

Although 70% accuracy sounds decent, it’s not immediately clear whether it is outstanding or poor. How well would a simplistic approach do to establish a baseline? Just as a broken clock is correct twice a day, randomly guessing a classification for each example would also occasionally produce the correct answer.

We could construct such a random “classifier” by picking a class at random in proportion to its prevalence in the training set. For example, if 30% of the training set were cover type 1, then the random classifier would guess “1” 30% of the time. Each classification would be correct in proportion to its prevalence in the test set. If 40% of the test set were cover type 1, then guessing “1” would be correct 40% of the time. Cover type 1 would then be guessed correctly 30% x 40% = 12% of the time and contribute 12% to overall accuracy. Therefore, we can evaluate the accuracy by summing these products of probabilities:

from pyspark.sql import DataFrame

total = data.count()
def classProbabilities(data):
  return data.groupBy("Cover_Type").count(). 1
  orderBy("Cover_Type"). 2
  select(col("count").cast(DoubleType())).
  withColumn("count_proportion", col("count")/total). 3
  select("count_proportion").collect()


trainPriorProbabilities = classProbabilities(trainData)
testPriorProbabilities = classProbabilities(testData)

trainPriorProbabilities
...

[Row(count_proportion=0.36457304055823236), Row(count_proportion=0.4877308528103696), Row(count_proportion=0.06148812265515851), Row(count_proportion=0.004766610030522358), Row(count_proportion=0.01626266429419196), Row(count_proportion=0.02986871754490915), Row(count_proportion=0.03530999210661611)]

...

trainPriorProbabilities = [p[0] for p in trainPriorProbabilities]
testPriorProbabilities = [p[0] for p in testPriorProbabilities]

sum([train_p * cv_p for train_p, cv_p in zip(trainPriorProbabilities, testPriorProbabilities)]) 4
...

0.3765186220733993
1

Count by category

2

Order counts by category

3

To data set

4

Sum products of pairs in training, test sets

Random guessing achieves 37% accuracy then, which makes 70% seem like a good result after all. But this result was achieved with default hyperparameters. We can do even better by exploring what the hyperparameters actually mean for the tree-building process. That is what we will do in the next section.

Decision Tree Hyperparameters

In chapter 3, the ALS algorithm exposed several hyperparameters whose values we had to choose by building models with various combinations of values and then assessing the quality of each result using some metric. The process is the same here, although the metric is now multiclass accuracy instead of AUC. The hyperparameters controlling how the tree’s decisions are chosen will be quite different as well: maximum depth, maximum bins, impurity measure, and minimum information gain.

Maximum depth simply limits the number of levels in the decision tree. It is the maximum number of chained decisions that the classifier will make to classify an example. It is useful to limit this to avoid overfitting the training data, as illustrated previously in the pet store example.

The decision tree algorithm is responsible for coming up with potential decision rules to try at each level, like the weight >= 100 or weight >= 500 decisions in the pet store example. Decisions are always of the same form: for numeric features, decisions are of the form feature >= value; and for categorical features, they are of the form feature in (value1, value2, …). So, the set of decision rules to try is really a set of values to plug in to the decision rule. These are referred to as “bins” in the PySpark MLlib implementation. A larger number of bins requires more processing time but might lead to finding a more optimal decision rule.

What makes a decision rule good? Intuitively, a good rule would meaningfully distinguish examples by target category value. For example, a rule that divides the Covtype data set into examples with only categories 1–3 on the one hand and 4–7 on the other would be excellent because it clearly separates some categories from others. A rule that resulted in about the same mix of all categories as are found in the whole data set doesn’t seem helpful. Following either branch of such a decision leads to about the same distribution of possible target values, and so doesn’t really make progress toward a confident classification.

Put another way, good rules divide the training data’s target values into relatively homogeneous, or “pure,” subsets. Picking a best rule means minimizing the impurity of the two subsets it induces. There are two commonly used measures of impurity: Gini impurity and entropy.

Gini impurity is directly related to the accuracy of the random-guess classifier. Within a subset, it is the probability that a randomly chosen classification of a randomly chosen example (both according to the distribution of classes in the subset) is incorrect. This is the sum of products of proportions of classes, but with themselves and subtracted from 1. If a subset has N classes and pi is the proportion of examples of class i, then its Gini impurity is given in the Gini impurity equation:

upper I Subscript upper G Baseline left-parenthesis p right-parenthesis equals 1 minus sigma-summation Underscript i equals 1 Overscript upper N Endscripts p Subscript i Superscript 2

If the subset contains only one class, this value is 0 because it is completely “pure.” When there are N classes in the subset, this value is larger than 0 and is largest when the classes occur the same number of times—maximally impure.

Entropy is another measure of impurity, borrowed from information theory. Its nature is more difficult to explain, but it captures how much uncertainty the collection of target values in the subset implies about predictions for data that falls in that subset. A subset containing one class suggests that the outcome for the subset is completely certain and has 0 entropy—no uncertainty. A subset containing one of each possible class, on the other hand, suggests a lot of uncertainty about predictions for that subset because data have been observed with all kinds of target values. This has high entropy. Hence, low entropy, like low Gini impurity, is a good thing. Entropy is defined by the entropy equation:

upper I Subscript upper E Baseline left-parenthesis p right-parenthesis equals sigma-summation Underscript i equals 1 Overscript upper N Endscripts p Subscript i Baseline log left-parenthesis StartFraction 1 Over p Subscript i Baseline EndFraction right-parenthesis equals minus sigma-summation Underscript i equals 1 Overscript upper N Endscripts p Subscript i Baseline log left-parenthesis p Subscript i Baseline right-parenthesis

Interestingly, uncertainty has units. Because the logarithm is the natural log (base e), the units are nats, the base-e counterpart to more familiar bits (which we can obtain by using log base 2 instead). It really is measuring information, so it’s also common to talk about the information gain of a decision rule when using entropy with decision trees.

One or the other measure may be a better metric for picking decision rules in a given data set. They are, in a way, similar. Both involve a weighted average: a sum over values weighted by pi. The default in PySpark’s implementation is Gini impurity.

Finally, minimum information gain is a hyperparameter that imposes a minimum information gain, or decrease in impurity, for candidate decision rules. Rules that do not improve the subsets impurity enough are rejected. Like a lower maximum depth, this can help the model resist overfitting because decisions that barely help divide the training input may in fact not helpfully divide future data at all.

Now that we understand the relevant hyperparemeters of a decision tree algorithm, we will tune our model in the next section to improve its performance.

Tuning Decision Trees

It’s not obvious from looking at the data which impurity measure leads to better accuracy, or what maximum depth or number of bins is enough without being excessive. Fortunately, as in chapter 3, it’s simple to let PySpark try a number of combinations of these values and report the results.

First, it’s necessary to set up a pipeline encapsulating the same two steps above. Creating the VectorAssembler and DecisionTreeClassifier and chaining these two Transformers together results in a single Pipeline object that represents these two operations together as one operation:

from pyspark.ml import Pipeline

assembler = VectorAssembler(inputCols=inputCols, outputCol="featureVector")
classifier = DecisionTreeClassifier(seed=1234, labelCol="Cover_Type", featuresCol="featureVector", predictionCol="prediction")

pipeline = Pipeline(stages=[assembler, classifier])

Naturally, pipelines can be much longer and more complex. This is about as simple as it gets. Now we can also define the combinations of hyperparameters that should be tested using the PySpark ML API’s built-in support, ParamGridBuilder. It’s also time to define the evaluation metric that will be used to pick the “best” hyperparameters, and that is again MulticlassClassificationEvaluator here.

from pyspark.ml.tuning import ParamGridBuilder

paramGrid = ParamGridBuilder(). 
  addGrid(classifier.impurity, ["gini", "entropy"]). 
  addGrid(classifier.maxDepth, [1, 20]). 
  addGrid(classifier.maxBins, [40, 300]). 
  addGrid(classifier.minInfoGain, [0.0, 0.05]). 
  build()

multiclassEval = MulticlassClassificationEvaluator(). 
  setLabelCol("Cover_Type"). 
  setPredictionCol("prediction"). 
  setMetricName("accuracy")

This means that a model will be built and evaluated for two values of four hyperparameters. That’s 16 models. They’ll be evaluated by multiclass accuracy. Finally, TrainValidationSplit brings these components together—the pipeline that makes models, model evaluation metrics, and hyperparameters to try—and can run the evaluation on the training data. It’s worth noting that CrossValidator could be used here as well to perform full k-fold cross-validation, but it is k times more expensive and doesn’t add as much value in the presence of big data. So, TrainValidationSplit is used here.

from pyspark.ml.tuning import TrainValidationSplit

validator = TrainValidationSplit(seed=1234,
  estimator=pipeline,
  evaluator=multiclassEval,
  estimatorParamMaps=paramGrid,
  trainRatio=0.9)

validatorModel = validator.fit(trainData)

This will take minutes or more, depending on your hardware, because it’s building and evaluating many models. Note the train ratio parameter is set to 0.9. This means that the training data is actually further subdivided by TrainValidationSplit into 90%/10% subsets. The former is used for training each model. The remaining 10% of the input is held out as a cross-validation set to evaluate the model. If it’s already holding out some data for evaluation, then why did we hold out 10% of the original data as a test set?

If the purpose of the CV set was to evaluate parameters that fit to the training set, then the purpose of the test set is to evaluate hyperparameters that were “fit” to the CV set. That is, the test set ensures an unbiased estimate of the accuracy of the final, chosen model and its hyperparameters.

Say that the best model chosen by this process exhibits 90% accuracy on the CV set. It seems reasonable to expect it will exhibit 90% accuracy on future data. However, there’s an element of randomness in how these models are built. By chance, this model and evaluation could have turned out unusually well. The top model and evaluation result could have benefited from a bit of luck, so its accuracy estimate is likely to be slightly optimistic. Put another way, hyperparameters can overfit too.

To really assess how well this best model is likely to perform on future examples, we need to evaluate it on examples that were not used to train it. But we also need to avoid examples in the CV set that were used to evaluate it. That is why a third subset, the test set, was held out.

The result of the validator contains the best model it found. This itself is a representation of the best overall pipeline it found, because we provided an instance of a pipeline to run. In order to query the parameters chosen by DecisionTreeClassifier, it’s necessary to manually extract DecisionTreeClassificationModel from the resulting PipelineModel, which is the final stage in the pipeline.

from pprint import pprint

bestModel = validatorModel.bestModel
pprint(bestModel.stages[1].extractParamMap())

...
{Param(parent='DecisionTreeClassifier_2b5c68a71772', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.'): 1,
 Param(parent='DecisionTreeClassifier_2b5c68a71772', name='minWeightFractionPerNode', doc='Minimum fraction of the weighted sample count that each child must have after split. If a split causes the fraction of the total weight in the left or right child to be less than minWeightFractionPerNode, the split will be discarded as invalid. Should be in interval [0.0, 0.5).'): 0.0,
 Param(parent='DecisionTreeClassifier_2b5c68a71772', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval.'): False,
 ...
}

This contains a lot of information about the fitted model, but it also tells us that “entropy” apparently worked best as the impurity measure and that a max depth of 20 was not surprisingly better than 1. It might be surprising that the best model was fit with just 40 bins, but this is probably a sign that 40 was “plenty” rather than “better” than 300. Lastly, no minimum information gain was better than a small minimum, which could imply that the model is more prone to underfit than overfit.

You may wonder if it is possible to see the accuracy that each of the models achieved for each combination of hyperparameters. The hyperparameters as well as the evaluations are exposed by getEstimatorParamMaps and validationMetrics, respectively. They can be combined to display all of the parameter combinations sorted by metric value:

validatorModel = validator.fit(trainData)

paramsAndMetrics = validatorModel.validationMetrics.
  zip(validatorModel.getEstimatorParamMaps).sortBy(-_._1)

paramsAndMetrics.foreach { case (metric, params) =>
    println(metric)
    println(params)
    println()
}

...
0.9138483377774368
{
	dtc_3e3b8bb692d1-impurity: entropy,
	dtc_3e3b8bb692d1-maxBins: 40,
	dtc_3e3b8bb692d1-maxDepth: 20,
	dtc_3e3b8bb692d1-minInfoGain: 0.0
}

0.9122369506416774
{
	dtc_3e3b8bb692d1-impurity: entropy,
	dtc_3e3b8bb692d1-maxBins: 300,
	dtc_3e3b8bb692d1-maxDepth: 20,
	dtc_3e3b8bb692d1-minInfoGain: 0.0
}
...

What was the accuracy that this model achieved on the CV set? And finally, what accuracy does the model achieve on the test set?

validatorModel.validationMetrics.max
multiclassEval.evaluate(bestModel.transform(testData)) 1

...
0.9138483377774368
0.9139978718291971
1

bestModel is a complete pipeline.

The results are both about 91%. It happens that the estimate from the CV set was pretty fine to begin with. In fact, it is not usual for the test set to show a very different result.

This is an interesting point at which to revisit the issue of overfitting. As discussed previously, it’s possible to build a decision tree so deep and elaborate that it fits the given training examples very well or perfectly but fails to generalize to other examples because it has fit the idiosyncrasies and noise of the training data too closely. This is a problem common to most machine learning algorithms, not just decision trees.

When a decision tree has overfit, it will exhibit high accuracy when run on the same training data that it fit the model to, but low accuracy on other examples. Here, the final model’s accuracy was about 91% on other, new examples. Accuracy can just as easily be evaluated over the same data that the model was trained on, trainData. This gives an accuracy of about 95%. The difference is not large but suggests that the decision tree has overfit the training data to some extent. A lower maximum depth might be a better choice.

So far, we’ve implicitly treated all input features, including categoricals, as if they’re numeric. Can we improve our model’s performance further by treating categorical features as exactly that? We will explore this next.

Categorical Features Revisited

The categorical features in our dataset are one-hot encoded as several binary 0/1 values. Treating these individual features as numeric turns out to be fine, because any decision rule on the “numeric” features will choose thresholds between 0 and 1, and all are equivalent since all values are 0 or 1.

Of course, this encoding forces the decision tree algorithm to consider the values of the underlying categorical features individually. Because features like soil type are broken down into many features, and because decision trees treat features individually, it is harder to relate information about related soil types.

For example, nine different soil types are actually part of the Leighcan family, and they may be related in ways that the decision tree can exploit. If soil type were encoded as a single categorical feature with 40 soil values, then the tree could express rules like “if the soil type is one of the nine Leighton family types” directly. However, when encoded as 40 features, the tree would have to learn a sequence of nine decisions on soil type to do the same, this expressiveness may lead to better decisions and more efficient trees.

However, having 40 numeric features represent one 40-valued categorical feature increases memory usage and slows things down.

What about undoing the one-hot encoding? This would replace, for example, the four columns encoding wilderness type with one column that encodes the wilderness type as a number between 0 and 3, like “Cover_Type”.

import org.apache.spark.sql.functions._

def unencodeOneHot(data: DataFrame): DataFrame = {
  val wildernessCols = (0 until 4).map(i => s"Wilderness_Area_$i").toArray

  val wildernessAssembler = new VectorAssembler().
    setInputCols(wildernessCols).
    setOutputCol("wilderness")

  val unhotUDF = udf((vec: Vector) => vec.toArray.indexOf(1.0).toDouble) 1

  val withWilderness = wildernessAssembler.transform(data).
    drop(wildernessCols:_*). 2
    withColumn("wilderness", unhotUDF($"wilderness")) 3

  val soilCols = (0 until 40).map(i => s"Soil_Type_$i").toArray

  val soilAssembler = new VectorAssembler().
    setInputCols(soilCols).
    setOutputCol("soil")

  soilAssembler.transform(withWilderness).
    drop(soilCols:_*).
    withColumn("soil", unhotUDF($"soil"))
}
1

Note UDF definition

2

Drop one-hot columns; no longer needed

3

Overwrite column with numeric one of same name

Here VectorAssembler is deployed to combine the 4 and 40 wilderness and soil type columns into two Vector columns. The values in these Vectors are all 0, except for one location that has a 1. There’s no simple DataFrame function for this, so we have to define our own UDF that can be used to operate on columns. This turns these two new columns into numbers of just the type we need.

From here, nearly the same process as above can be used to tune the hyperparameters of a decision tree model built on this data and to choose and evaluate a best model. There’s one important difference, however. The two new numeric columns have nothing about them that indicates they’re actually an encoding of categorical values. To treat them as numbers would be wrong, as their ordering is meaningless. However, it would silently succeed; the information in these features would be all but lost though.

Internally MLlib can store additional metadata about each column. The details of this data are generally hidden from the caller, but includes information such as whether the column encodes a categorical value and how many distinct values it takes on. In order to add this metadata, it’s necessary to put the data through VectorIndexer. Its job is to turn input into properly labeled categorical feature columns. Although we did much of the work already to turn the categorical features into 0-indexed values, VectorIndexer will take care of the metadata.

We need to add this stage to the Pipeline:

import org.apache.spark.ml.feature.VectorIndexer

val unencTrainData = unencodeOneHot(trainData)

val inputCols = unencTrainData.columns.filter(_ != "Cover_Type")
val assembler = new VectorAssembler().
  setInputCols(inputCols).
  setOutputCol("featureVector")

val indexer = new VectorIndexer().
  setMaxCategories(40). 1
  setInputCol("featureVector").
  setOutputCol("indexedVector")

val classifier = new DecisionTreeClassifier().
  setSeed(Random.nextLong()).
  setLabelCol("Cover_Type").
  setFeaturesCol("indexedVector").
  setPredictionCol("prediction")

val pipeline = new Pipeline().setStages(Array(assembler, indexer, classifier))
1

>= 40 because soil has 40 values

The approach assumes that the training set contains all possible values of each of the categorical features at least once. That is, it works correctly only if all 4 soil values and all 40 wilderness values appear in the training set so that all possible values get a mapping. Here, that happens to be true, but may not be for small training sets of data in which some labels appear very infrequently. In those cases, it could be necessary to manually create and add a VectorIndexerModel with the complete value mapping supplied manually.

Aside from that, the process is the same as before. You should find that it chose a similar best model but that accuracy on the test set is about 93%. By treating categorical features as actual categorical features, the classifier improved its accuracy by almost 2%.

We have trained and tuned a decision tree. Now, we will move on to random forests, a more powerful algorithm. As we will see in the next section, implementing them using PySpark will be surprisingly straightforwad at this point.

Random Forests

If you have been following along with the code examples, you may have noticed that your results differ slightly from those presented in the code listings in the book. That is because there is an element of randomness in building decision trees, and the randomness comes into play when you’re deciding what data to use and what decision rules to explore.

The algorithm does not consider every possible decision rule at every level. To do so would take an incredible amount of time. For a categorical feature over N values, there are 2N–2 possible decision rules (every subset except the empty set and entire set). For even moderately large N, this would create billions of candidate decision rules.

Instead, decision trees use several heuristics to determine which few rules to actually consider. The process of picking rules also involves some randomness; only a few features picked at random are looked at each time, and only values from a random subset of the training data. This trades a bit of accuracy for a lot of speed, but it also means that the decision tree algorithm won’t build the same tree every time. This is a good thing.

It’s good for the same reason that the “wisdom of the crowds” usually beats individual predictions. To illustrate, take this quick quiz: How many black taxis operate in London?

Don’t peek at the answer; guess first.

I guessed 10,000, which is well off the correct answer of about 19,000. Because I guessed low, you’re a bit more likely to have guessed higher than I did, and so the average of our answers will tend to be more accurate. There’s that regression to the mean again. The average guess from an informal poll of 13 people in the office was indeed closer: 11,170.

A key to this effect is that the guesses were independent and didn’t influence one another. (You didn’t peek, did you?) The exercise would be useless if we had all agreed on and used the same methodology to make a guess, because the guesses would have been the same answer—the same potentially quite wrong answer. It would even have been different and worse if I’d merely influenced you by stating my guess upfront.

It would be great to have not one tree, but many trees, each producing reasonable but different and independent estimations of the right target value. Their collective average prediction should fall close to the true answer, more than any individual tree’s does. It’s the randomness in the process of building that helps create this independence. This is the key to random forests.

Randomness is injected by building many trees, each of which sees a different random subset of data—and even of features. This makes the forest as a whole less prone to overfitting. If a particular feature contains noisy data or is deceptively predictive only in the training set, then most trees will not consider this problem feature most of the time. Most trees will not fit the noise and will tend to “outvote” the trees that have fit the noise in the forest.

The prediction of a random forest is simply a weighted average of the trees’ predictions. For a categorical target, this can be a majority vote or the most probable value based on the average of probabilities produced by the trees. Random forests, like decision trees, also support regression, and the forest’s prediction in this case is the average of the number predicted by each tree.

While random forests are a more powerful and complex classification technique, the good news is that it’s virtually no different to use it in the pipeline that has been developed in this chapter. Simply drop in a RandomForestClassifier in place of DecisionTreeClassifier and proceed as before. There’s really no more code or API to understand in order to use it.

from pyspark.ml.classification import RandomForestClassifier

classifier = RandomForestClassifier(seed=1234, labelCol="Cover_Type", featuresCol="indexedVector", predictionCol="prediction")

Note that this classifier has another hyperparameter: the number of trees to build. Like the max bins hyperparameter, higher values should give better results up to a point. The cost, however, is that building many trees of course takes many times longer than building one.

The accuracy of the best random forest model produced from a similar tuning process is 95% off the bat—about 2% better already, although viewed another way, that’s a 28% reduction in the error rate over the best decision tree built previously, from 7% down to 5%. You may do better with further tuning.

Incidentally, at this point we have a more reliable picture of feature importance:

from pyspark.ml.classificatiom import RandomForestClassificationModel

forestModel = bestModel.stages[1]

forestModel.featureImportances.toArray.zip(inputCols).
  sorted.reverse.foreach(println)

...
(0.28877055118903183,Elevation)
(0.17288279582959612,soil)
(0.12105056811661499,Horizontal_Distance_To_Roadways)
(0.1121550648692802,Horizontal_Distance_To_Fire_Points)
(0.08805270405239551,wilderness)
(0.04467393191338021,Vertical_Distance_To_Hydrology)
(0.04293099150373547,Horizontal_Distance_To_Hydrology)
(0.03149644050848614,Hillshade_Noon)
(0.028408483578137605,Hillshade_9am)
(0.027185325937200706,Aspect)
(0.027075578474331806,Hillshade_3pm)
(0.015317564027809389,Slope)

Random forests are appealing in the context of big data because trees are supposed to be built independently, and big data technologies like Spark and MapReduce inherently need data-parallel problems, where parts of the overall solution can be computed independently on parts of the data. The fact that trees can, and should, train on only a subset of features or input data makes it trivial to parallelize building the trees.

Making Predictions

Building a classifier, while an interesting and nuanced process, is not the end goal. The goal is to make predictions. This is the payoff, and it is comparatively quite easy.

The resulting “best model” is actually a whole pipeline of operations, which encapsulate how input is transformed for use with the model and includes the model itself, which can make predictions. It can operate on a data frame of new input. The only difference from the data DataFrame we started with is that it lacks the “Cover_Type” column. When we’re making predictions—especially about the future, says Mr. Bohr—the output is of course not known.

To prove it, try dropping the “Cover_Type” from the test data input and obtaining a prediction:

bestModel.transform(unencTestData.drop("Cover_Type")).select("prediction").show()

...
+----------+
|prediction|
+----------+
|       6.0|
+----------+

The result should be 6.0, which corresponds to class 7 (the original feature was 1-indexed) in the original Covtype data set. The predicted cover type for the land described in this example is Krummholz.

Where to Go from Here

This chapter introduced two related and important types of machine learning, classification and regression, along with some foundational concepts in building and tuning models: features, vectors, training, and cross-validation. It demonstrated how to predict a type of forest cover from things like location and soil type using the Covtype data set, with decision trees and forests implemented in PySpark.

As with recommenders in chapter 3, it could be useful to continue exploring the effect of hyperparameters on accuracy. Most decision tree hyperparameters trade time for accuracy: more bins and trees generally produce better accuracy but hit a point of diminishing returns.

The classifier here turned out to be very accurate. It’s unusual to achieve more than 95% accuracy. In general, you will achieve further improvements in accuracy by including more features or transforming existing features into a more predictive form. This is a common, repeated step in iteratively improving a classifier model. For example, for this data set, the two features encoding horizontal and vertical distance-to-surface-water features could produce a third feature: straight-line distance-to-surface-water features. This might turn out to be more useful than either original feature. Or, if it were possible to collect more data, we might try adding new information like soil moisture in order to improve classification.

Of course, not all prediction problems in the real world are exactly like the Covtype data set. For example, some problems require predicting a continuous numeric value, not a categorical value. Much of the same analysis and code applies to this type of regression problem; the RandomForestRegressor class will be of use in this case.

Furthermore, decision trees and forests are not the only classification or regression algorithms, and not the only ones implemented in PySpark, as we’d pointed out at the beginning of the chapter. Each of these algorithms operates quite differently from decision trees and forests. However, many elements are the same: they plug into a Pipeline and operate on columns in a data frame, and have hyperparameters that you must select using training, cross-validation, and test subsets of the input data. The same general principles, with these other algorithms, can also be deployed to model classification and regression problems.

These have been examples of supervised learning. What happens when some, or all, of the target values are unknown? The following chapter will explore what can be done in this situation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset