In the previous section, we described our business use case and also prepared our Spark computing platform as well as our datasets. In this section, we need to select our analytical methods or predictive models (equations) for this fraud detection project, which is to complete a task of mapping our business use case to machine learning methods.
For fraud detection, both supervised machine learning and unsupervised machine learning are commonly used. However, for this case, we will perform a supervised machine learning because we do have good data for our target variable of fraud and also because our practical goal is to reduce frauds while continuing business transactions.
To model and predict frauds, there are many suitable models, including logistic regression and the decision tree. Selecting one among them can sometimes become extremely difficult as it depends on the data to be used. One solution is to first run all the models and then select the best ones using model evaluation indices. As in many situations, after applying evaluation methods, there may be no one best model but many best models. In this case, we will ensemble all of them to improve our model's performance.
To adopt the preceding mentioned strategy, we will need to develop a few models of neural network, logistic regression, SVM, and decision trees. However, for this exercise, we will focus our effort on Random forest and decision trees, as to demonstrate machine learning on Apache Spark and also to demonstrate their usefulness to meet the special needs of this use case.
As always, once we finalize our decision for analytical methods or models, we will need to prepare the related dependent variable and also prepare for coding.
Random forest is a quite popular machine learning method because its interpretation is very intuitive, and it usually leads to good results. There are many algorithms developed in R, Java, and others to get Random forest implemented, so the preparation is relatively easy.
// To train a RandomForest model. val treeStrategy = Strategy.defaultStrategy("Classification") val numTrees = 300 val featureSubsetStrategy = "auto" // Let the algorithm choose. val model = RandomForest.trainClassifier(trainingData, treeStrategy, numTrees, featureSubsetStrategy, seed = 12345)
For a good example of using MLlib for Random forest, go to https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html.
In R, we need to use the R package Random forest.
For a good example of running random forest on Spark, go to https://spark-summit.org/2014/wp-content/uploads/2014/07/Sequoia-Forest-Random-Forest-of-Humongous-Trees-Sung-Chung.pdf.
Random forest comes from a set of trees with good functions to produce scores and rank independent variables by their impact on a target variable.
However, the mean results of hundreds of trees somehow cover details so that a decision tree explanation can still be very intuitive and valuable, as follows:
val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val impurity = "gini" val maxDepth = 6 val maxBins = 32 val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
As for the R notebook on Spark, we will continue to use the package rpart
and then the rpart
function for all the calculation. For rpart
, we need to specify the classifier and also all the features to be used.
18.116.60.62