We have seen classification trees in the previous chapter. One can build a recursive split-and-concur structure for a regression problem, where a split is chosen to minimize the remaining variance. Regression trees are less popular than decision trees or classical ANOVA analysis; however, let's provide an example of a regression tree here as a part of MLlib:
akozlov@Alexanders-MacBook-Pro$ bin/spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_ version 1.6.1-SNAPSHOT /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.DecisionTree scala> import org.apache.spark.mllib.tree.model.DecisionTreeModel import org.apache.spark.mllib.tree.model.DecisionTreeModel scala> import org.apache.spark.mllib.util.MLUtils import org.apache.spark.mllib.util.MLUtils scala> // Load and parse the data file. scala> val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[6] at map at MLUtils.scala:112 scala> // Split the data into training and test sets (30% held out for testing) scala> val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3)) trainingData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[7] at randomSplit at <console>:26 testData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[8] at randomSplit at <console>:26 scala> val categoricalFeaturesInfo = Map[Int, Int]() categoricalFeaturesInfo: scala.collection.immutable.Map[Int,Int] = Map() scala> val impurity = "variance" impurity: String = variance scala> val maxDepth = 5 maxDepth: Int = 5 scala> val maxBins = 32 maxBins: Int = 32 scala> val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, maxDepth, maxBins) model: org.apache.spark.mllib.tree.model.DecisionTreeModel = DecisionTreeModel regressor of depth 2 with 5 nodes scala> val labelsAndPredictions = testData.map { point => | val prediction = model.predict(point.features) | (point.label, prediction) | } labelsAndPredictions: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[20] at map at <console>:36 scala> val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean() testMSE: Double = 0.07407407407407407 scala> println(s"Test Mean Squared Error = $testMSE") Test Mean Squared Error = 0.07407407407407407 scala> println("Learned regression tree model: " + model.toDebugString) Learned regression tree model: DecisionTreeModel regressor of depth 2 with 5 nodes If (feature 378 <= 71.0) If (feature 100 <= 165.0) Predict: 0.0 Else (feature 100 > 165.0) Predict: 1.0 Else (feature 378 > 71.0) Predict: 1.0
The splits at each level are made to minimize the variance, as follows:
which is equivalent to minimizing the distances between the label values and their mean within each leaf summed over all the leaves of the node.
52.14.190.74