Methods of attrition prediction

In the previous section, we described our use case of predicting student attrition and also prepared our Spark computing platform. In this section, we need to perform the task of mapping our use case to machine learning methods, which is to select our analytical methods or predictive models (equations) for this attrition prediction project.

To model and predict student attrition, the most suitable models include logistic regression and decision tree, as both of them yield good results. Some researchers use neural network and SVM models, but the results are no better than logistic regression. Therefore, for this exercise, we will focus our efforts on logistic regression and decision trees, as well as random forest as an extension of decision tree, and then use model evaluation to determine which one is the best.

As always, once we finalize our decision regarding analytical methods or models, we need to prepare for coding.

Regression models

Regression was used in the previous chapters; especially in Chapter 6, Churn Prediction on Spark, we used logistic regression with good results. As predicting student attrition has a lot in common with the work of predicting customer churn, we will reuse a lot of the work presented in Chapter 6, Churn Prediction on Spark.

About regression

There are two kinds of regression modeling that are suitable for attrition prediction, similar to churn prediction. One is linear regression, and another is logistic regression. For this project, logistic regression is more suitable as we have a target variable about whether the student left; we even have the target variable of student performance. Logistic regression is an alternative method to modeling discrete choice using maximum likelihood estimation based on the logistic function as opposed to ordinary least squares (linear probability models). A major advantage of logistic regression for dichotomous dependent variables is that it overcomes the inherent heteroskedasticity (that is, nonconstant variance) associated with linear probability models, which is often a special concern for our student data.

Preparing for coding

As before, in MLlib, for logistic regression, we will use the following:

val model = new LogisticRegressionWithSGD()
.setNumClasses(2)

Decision trees

As discussed briefly in Chapter 6, Churn Prediction on Spark, in comparison to regression, decision tree is easy to use, robust with missing data, and easy to interpret. Here, our reason to use decision tree is mainly due to its robustness with missing data, as missing data is a big issue with this real use case. Also, decision tree models produce good charts that clearly express the impact of various features on leading a student to leave, so it is very useful for result interpretation and intervention design.

Random forest comes from a set of trees, often hundreds of trees, with good functions to produce scores and rank independent variables by their impact on the target variable. For these two reasons, we will also use random forest for this case.

Preparing for coding

As before, within MLlib, we can use the following code:

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 6
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses,
  categoricalFeaturesInfo, impurity, maxDepth, maxBins)

We need to expand our work to random forest, so with MLlib, we will use the following code for this:

// To train a RandomForest model.
val treeStrategy = Strategy.defaultStrategy("Classification")
val numTrees = 300
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val model = RandomForest.trainClassifier(trainingData,
  treeStrategy, numTrees, featureSubsetStrategy, seed = 12345)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.196.244