After we passed our model evaluation stage and decided to select the estimated and evaluated model as our final model, our next task is to interpret results to the company executives and technicians.
Here, we will work on results explanation with a focus on large influencing variables.
As we briefly discussed before, quality and freshness are very different for each dataset. Each data has its own weakness, as summarized in the following:
Category |
Weakness |
---|---|
Web Log |
incomplete |
Account |
old |
Computer device |
incomplete |
User |
old |
Business |
Incomplete and old |
Due to the preceding issues, we often do not have enough data to score each transaction or score it with good accuracy, and we can only score it later. Because of this, the company hopes to identify some special signals or insights that can be used to take action quickly and easily.
The following briefly summarizes some of the result samples that we use some functions from randomForest
and decision tree to produce.
With the randomForest
package in R, a simple code of estimatedModel$importance
will return a ranking of variables by their importance in determining frauds.
Tables for Impact Assessment:
Feature |
Impacts |
---|---|
Click speed |
1 |
Account |
2 |
ComputerDevice |
3 |
Here, obtaining variable importance through the randomForest
functions needs a full model estimated and will complete all data. So, it does not really solve our problems.
What customers really needed is actually to use a partial set of available features to estimate a model with limited variables and then assess how good this partial model is, which is to tell the fraud catching and false positive ratio. To complete this task, Apache Spark's advantage of fast computing is utilized, which helps get results.
3.149.249.252