Methods for churn prediction

In the previous section, we have completed our task of describing the business use case, and that of preparing our Spark computing platform and our datasets. In this section, we need to select our analytical methods or predictive models (equations) for this churn prediction project, that is, to map our business use case to machine learning methods.

As per the research done over a period of many years, customer satisfaction professionals believe that product and services features affect the quality of services, which affects customer satisfaction, finally affecting customer churns. Therefore, we should somehow incorporate this piece of knowledge into our model design or equation specification.

From an analytical perspective, there are many suitable models for modelling and predicting customer churns, and among them, the most commonly used are logistic regression and decision trees. For this exercise, we will use both, and then use evaluation to determine which one is the best.

As always, once we finalize our decision for analytical methods or models, we will need to prepare the related target variable and also prepare for coding, in this case with the Spark machine learning libraries.

Regression models

Regression is one of the most commonly used methods for prediction, and has been used to model customer churns by many machine learning professionals.

  • Types of regression models: There are two main kinds of regression models that are suitable for churn prediction. One is linear regression, and the other is logistic regression. For this project, logistic regression is more suitable, as we have a target variable about whether the customer departed, with discrete values. But, for the real-life project, we have also used linear regression to model customer satisfaction, as many predictors impact customer satisfaction and, thus, customer churns. But in this case, as an example, our focus will be on logistic regression. To further improve our model performance, we may try LassoModel and RidgeRegressionModel, which are available in MLlib.
  • Preparing coding: In MLlib, for linear regression, we will use the same code used earlier as follows:
    val numIterations = 95
    val model = LinearRegressionWithSGD.train(TrainingData, numIterations)

    Also for logistic regression, we will the code used earlier as follows:

    val model = new LogisticRegressionWithSGD()
      .setNumClasses(2)
      .run(training)

Decision trees and Random forest

Both decision trees and random forest aim to model classifying cases, which is about classifying into departed or not departed for our use case, in sequence with results to be illustrated by trees.

  • Introduction to Decision trees and Random forest.

    Specifically, decision tree modeling uses tree branching based on value comparisons to illustrate the impacts of predictive features, which in comparison to logistic regression, is easy to use and also robust with missing data. Robustness with missing data has a big advantage for this use case, as we do have a significant amount of data incompleteness here.

    Random Forest comes from a set of trees, and often hundreds of trees, with ready-to-use functions for producing risk scores (churn probabilities), and for ranking predictive variables by their impact on the target variable, which is very useful for us to help identify bigger interventions for reducing customer churns.

    However, the mean results of hundreds and hundreds of trees somehow obscures the details, so a decision tree explanation can still be very intuitive and valuable.

  • Prepare coding.

    As done earlier, within MLlib, we can use the following code:

    val numClasses = 2
    val categoricalFeaturesInfo = Map[Int, Int]()
    val impurity = "gini"
    val maxDepth = 6
    val maxBins = 32
    
    val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
      impurity, maxDepth, maxBins) 

    We may also expand our work to Random Forest and, in MLlib, we can use the following code for random forest:

    // To train a RandomForest model.
    val treeStrategy = Strategy.defaultStrategy("Classification")
    val numTrees = 300 
    val featureSubsetStrategy = "auto" // Let the algorithm choose.
    val model = RandomForest.trainClassifier(trainingData,
      treeStrategy, numTrees, featureSubsetStrategy, seed = 12345)

    Note

    More guidance about coding for decisions can be found at:

    http://spark.apache.org/docs/latest/mllib-decision-tree.html

    and for Random Forest at:

    http://spark.apache.org/docs/latest/mllib-ensembles.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.37.191