The larger the attribute space or the number of dimensions, the harder it is to usually predict the label for a given combination of attribute values. This is mostly due to the fact that the total number of possible distinct combinations of attributes increases exponentially with the dimensionality of the attribute space—at least in the case of discrete variables (in case of continuous variables, the situation is more complex and depends on the metrics used), and it is becoming harder to generalize.
The effective dimensionality of the problem might be different from the dimensionality of the input space. For example, if the label depends only on the linear combination of the (continuous) input attributes, the problem is called linearly separable and its internal dimensionality is one—we still have to find the coefficients for this linear combination like in logistic regression though.
This idea is also sometimes referred to as a Vapnik–Chervonenkis (VC) dimension of a problem, model, or algorithm—the expressive power of the model depending on how complex the dependencies that it can solve, or shatter, might be. More complex problems require algorithms with higher VC dimensions and larger training sets. However, using an algorithm with higher VC dimension on a simple problem can lead to overfitting and worse generalization to new data.
If the units of input attributes are comparable, say all of them are meters or units of time, PCA, or more generally, kernel methods, can be used to reduce the dimensionality of the input space:
$ bin/spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LabeledPoint scala> import org.apache.spark.mllib.feature.PCA import org.apache.spark.mllib.feature.PCA scala> import org.apache.spark.mllib.util.MLUtils import org.apache.spark.mllib.util.MLUtils scala> val pca = new PCA(2).fit(data.map(_.features)) pca: org.apache.spark.mllib.feature.PCAModel = org.apache.spark.mllib.feature.PCAModel@4eee0b1a scala> val reduced = data.map(p => p.copy(features = pca.transform(p.features))) reduced: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[311] at map at <console>:39 scala> reduced.collect().take(10) res4: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,[-2.827135972679021,-5.641331045573367]), (0.0,[-2.7959524821488393,-5.145166883252959]), (0.0,[-2.621523558165053,-5.177378121203953]), (0.0,[-2.764905900474235,-5.0035994150569865]), (0.0,[-2.7827501159516546,-5.6486482943774305]), (0.0,[-3.231445736773371,-6.062506444034109]), (0.0,[-2.6904524156023393,-5.232619219784292]), (0.0,[-2.8848611044591506,-5.485129079769268]), (0.0,[-2.6233845324473357,-4.743925704477387]), (0.0,[-2.8374984110638493,-5.208032027056245])) scala> import scala.language.postfixOps import scala.language.postfixOps scala> pca pc res24: org.apache.spark.mllib.linalg.DenseMatrix = -0.36158967738145065 -0.6565398832858496 0.08226888989221656 -0.7297123713264776 -0.856572105290527 0.17576740342866465 -0.35884392624821626 0.07470647013502865 scala> import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD} import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD} scala> import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics scala> val splits = reduced.randomSplit(Array(0.6, 0.4), seed = 1L) splits: Array[org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]] = Array(MapPartitionsRDD[312] at randomSplit at <console>:44, MapPartitionsRDD[313] at randomSplit at <console>:44) scala> val training = splits(0).cache() training: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[312] at randomSplit at <console>:44 scala> val test = splits(1) test: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[313] at randomSplit at <console>:44 scala> val numIterations = 100 numIterations: Int = 100 scala> val model = SVMWithSGD.train(training, numIterations) model: org.apache.spark.mllib.classification.SVMModel = org.apache.spark.mllib.classification.SVMModel: intercept = 0.0, numFeatures = 2, numClasses = 2, threshold = 0.0 scala> model.clearThreshold() res30: model.type = org.apache.spark.mllib.classification.SVMModel: intercept = 0.0, numFeatures = 2, numClasses = 2, threshold = None scala> val scoreAndLabels = test.map { point => | val score = model.predict(point.features) | (score, point.label) | } scoreAndLabels: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[517] at map at <console>:54 scala> val metrics = new BinaryClassificationMetrics(scoreAndLabels) metrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@27f49b8c scala> val auROC = metrics.areaUnderROC() auROC: Double = 1.0 scala> println("Area under ROC = " + auROC) Area under ROC = 1.0
Here, we reduced the original four-dimensional problem to two-dimensional. Like averaging, computing linear combinations of input attributes and selecting only those that describe most of the variance helps to reduce noise.
3.142.98.240