In the previous section, we have completed our task of describing the business use case, and that of preparing our Spark computing platform and our datasets. In this section, we need to select our analytical methods or predictive models (equations) for this churn prediction project, that is, to map our business use case to machine learning methods.
As per the research done over a period of many years, customer satisfaction professionals believe that product and services features affect the quality of services, which affects customer satisfaction, finally affecting customer churns. Therefore, we should somehow incorporate this piece of knowledge into our model design or equation specification.
From an analytical perspective, there are many suitable models for modelling and predicting customer churns, and among them, the most commonly used are logistic regression and decision trees. For this exercise, we will use both, and then use evaluation to determine which one is the best.
As always, once we finalize our decision for analytical methods or models, we will need to prepare the related target variable and also prepare for coding, in this case with the Spark machine learning libraries.
Regression is one of the most commonly used methods for prediction, and has been used to model customer churns by many machine learning professionals.
val numIterations = 95 val model = LinearRegressionWithSGD.train(TrainingData, numIterations)
Also for logistic regression, we will the code used earlier as follows:
val model = new LogisticRegressionWithSGD() .setNumClasses(2) .run(training)
Both decision trees and random forest aim to model classifying cases, which is about classifying into departed or not departed for our use case, in sequence with results to be illustrated by trees.
Specifically, decision tree modeling uses tree branching based on value comparisons to illustrate the impacts of predictive features, which in comparison to logistic regression, is easy to use and also robust with missing data. Robustness with missing data has a big advantage for this use case, as we do have a significant amount of data incompleteness here.
Random Forest comes from a set of trees, and often hundreds of trees, with ready-to-use functions for producing risk scores (churn probabilities), and for ranking predictive variables by their impact on the target variable, which is very useful for us to help identify bigger interventions for reducing customer churns.
However, the mean results of hundreds and hundreds of trees somehow obscures the details, so a decision tree explanation can still be very intuitive and valuable.
As done earlier, within MLlib, we can use the following code:
val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val impurity = "gini" val maxDepth = 6 val maxBins = 32 val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
We may also expand our work to Random Forest and, in MLlib, we can use the following code for random forest:
// To train a RandomForest model. val treeStrategy = Strategy.defaultStrategy("Classification") val numTrees = 300 val featureSubsetStrategy = "auto" // Let the algorithm choose. val model = RandomForest.trainClassifier(trainingData, treeStrategy, numTrees, featureSubsetStrategy, seed = 12345)
More guidance about coding for decisions can be found at:
http://spark.apache.org/docs/latest/mllib-decision-tree.html
3.138.37.191