As discussed before, MLlib supports model export to Predictive Model Markup Language (PMML). Therefore, we export some developed models to PMML for this project as some other departments of the university are interested in our analytical results and use other systems such as SPSS.
However, for practical purposes, the users of this project are more interested in rule-based decision making to use some of our insights and also in score-based decision making to reduce student attrition.
Specifically, as for this project, the client is interested in applying our results to, firstly, decide which interventions to use for a combination of course adjustments or counseling services with a special student segment, and, secondly, when the university needs to start some interventions as per the student attrition score.
Therefore, we need to turn some of our results into rules and also produce a student attrition risk score for this university.
All the algorithms either in MLlib or R can produce trees directly so that users may use these trees to derive rules directly.
Also, as discussed before, for R results, there are several tools to help extract rules from developed predictive models.
For the decision tree model developed, we should use the rpart.utils
R package, which can extract rules and export them in various formats, such as RODBC.
The rpart.rules.table(model1)
returns an unpivoted table of variable values (factor levels) associated with each branch, that is, sub rules to be used.
However, for this project, partially due to the issue of data incompleteness, it is better for us to use some insight into deriving rules directly. That is, we should use the insight discussed in last section. For example, we can do the following:
From an analytical perspective, one of the main issues here is to minimize the false positive while catching enough attritions.
The university had a high false positive ratio from using their past rules, and as a result of this, too many alerts were sent out, adding a big burden for manual inspection. Therefore, by taking advantage of Spark's fast computing, we carefully produced rules, and for each rule, we supplied false positive ratios that helped the university use these rules as well as provide useful feedback.
From coefficients of our predictive models, we can derive a probability score for attrition, but this takes some work.
Using the following MLlib code, we can obtain probability scores quickly:
// Compute raw scores on the test set. val predictionAndLabels = test.map { case LabeledPoint(label, features) => val prediction = model.predict(features) (prediction, label) }
The preceding code returns labels, but for binary classification, you can use the LogisticRegressionModel.clearThreshold
method. After it is called, predict
will return raw scores:
Unlike the labels mentioned before, these are in the [0, 1] range and can be interpreted as probabilities.
Using R, model$predicted
will return the case class as ATTRITION
or NOT
. However, prob=predict(model,x,type="prob")
will produce a probability value, which can be used directly as a score.
However, in order to use the score, we need to select a cutting out score. For example, we can choose to take action when the attrition probability score is over 80.
Different score cutting points will produce different false positive ratios and also the ratios of catching possible attrition, for which the users need to make a decision about how to balance the results.
By taking advantage of Spark's fast computing, results can be calculated fast, which allows the university to select a cutting point instantly and make changes whenever needed.
Another way to deal with this issue is to use the OptimalCutpoints
R package.
18.225.149.238