Methods for fraud detection

In the previous section, we described our business use case and also prepared our Spark computing platform as well as our datasets. In this section, we need to select our analytical methods or predictive models (equations) for this fraud detection project, which is to complete a task of mapping our business use case to machine learning methods.

For fraud detection, both supervised machine learning and unsupervised machine learning are commonly used. However, for this case, we will perform a supervised machine learning because we do have good data for our target variable of fraud and also because our practical goal is to reduce frauds while continuing business transactions.

To model and predict frauds, there are many suitable models, including logistic regression and the decision tree. Selecting one among them can sometimes become extremely difficult as it depends on the data to be used. One solution is to first run all the models and then select the best ones using model evaluation indices. As in many situations, after applying evaluation methods, there may be no one best model but many best models. In this case, we will ensemble all of them to improve our model's performance.

To adopt the preceding mentioned strategy, we will need to develop a few models of neural network, logistic regression, SVM, and decision trees. However, for this exercise, we will focus our effort on Random forest and decision trees, as to demonstrate machine learning on Apache Spark and also to demonstrate their usefulness to meet the special needs of this use case.

As always, once we finalize our decision for analytical methods or models, we will need to prepare the related dependent variable and also prepare for coding.

Random forest

Random forest is a quite popular machine learning method because its interpretation is very intuitive, and it usually leads to good results. There are many algorithms developed in R, Java, and others to get Random forest implemented, so the preparation is relatively easy.

  • Random forest: Random forest is an ensemble learning method for classification and regression that builds hundreds or even more decision trees at the training stage and then combines their output for the final prediction.
  • Preparing the dependent variable: To use Random forest, we need to recode the variable to be 0 versus 1 by transforming FRAUD to 1 and NOT FRAUD to 0.
  • Preparing coding: In MLlib, we can use the following codes for Random forest:
    // To train a RandomForest model.
    val treeStrategy = Strategy.defaultStrategy("Classification")
    val numTrees = 300 
    val featureSubsetStrategy = "auto" // Let the algorithm choose.
    val model = RandomForest.trainClassifier(trainingData,
      treeStrategy, numTrees, featureSubsetStrategy, seed = 12345)

For a good example of using MLlib for Random forest, go to https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html.

In R, we need to use the R package Random forest.

For a good example of running random forest on Spark, go to https://spark-summit.org/2014/wp-content/uploads/2014/07/Sequoia-Forest-Random-Forest-of-Humongous-Trees-Sung-Chung.pdf.

Decision trees

Random forest comes from a set of trees with good functions to produce scores and rank independent variables by their impact on a target variable.

However, the mean results of hundreds of trees somehow cover details so that a decision tree explanation can still be very intuitive and valuable, as follows:

  • Decision tree introduction: Decision tree aims to model classifying cases, which is about classifying into fraud or not fraud for our use case, in sequence
  • Prepare dependent variable: Our target variable has already been coded as fraud or not, which is ready for our machine learning
  • Prepare coding: As before, within MLlib, we can use the following codes:
    val numClasses = 2
    val categoricalFeaturesInfo = Map[Int, Int]()
    val impurity = "gini"
    val maxDepth = 6
    val maxBins = 32
    val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
      impurity, maxDepth, maxBins) 

As for the R notebook on Spark, we will continue to use the package rpart and then the rpart function for all the calculation. For rpart, we need to specify the classifier and also all the features to be used.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.162.49