Chapter 4. Fraud Detection on Spark

In Chapter 1, Spark for Machine Learning, we discussed how to get the Apache Spark system ready, and in Chapter 2, Data Preparation for Spark ML, we listed detailed instructions for data preparation. Now, in chapters 4 to 6, we will move to a new stage of utilizing Apache Spark-based systems to turn data into insights for some specific projects, which is fraud detection for this chapter; risk modeling for Chapter 5, Risk Scoring on Spark; and churn prediction for Chapter 6, Churn Prediction on Spark.

Specifically, in this chapter, we will review machine learning methods and analytical processes for a fraud detection project, and also discuss how Apache Spark makes them easy and fast. At the same time, with a real-life fraud detection example, we will illustrate our step-by-step process of obtaining fraud insight from big data.

  • Spark for fraud detection
  • Methods of fraud detection
  • Feature preparation
  • Model estimation
  • Model evaluation
  • Result explanation
  • Deploying fraud detection

Spark for fraud detection

In this section, we will start with a real business case of fraud detection to further illustrate our step-by-step machine learning process and then describe how to prepare Spark for this fraud detection project.

The use case

The ABC Corporation is a billion-dollar company that processes payments for thousands of clients in many industries, including real estate and vacation travel. Many kinds of frauds happened to this company and cost a lot. Most of the frauds happened online.

In order to prevent frauds from happening, the company collected a lot of data on its clients relating to payment processing transactions and also about past online activities for each client. Also, the company purchased a lot of data from third parties about the computer devices and bank accounts their clients use.

As for this project, our unit of analysis can be an individual company or person (ABC's client). Our unit of analysis can also be a payment transaction. In real practice, we performed modeling on both. However, as for the practice here, we will focus on analytics and transactions. Therefore, in terms of data and features, for each transaction online, we have web log data, data about its owner/user, and also data on the computer devices and bank accounts used.

In practice, the ABC company hopes to quickly score each transaction as per the likelihood of fraud and hopes to immediately stop a transaction if it is highly suspicious. Also, the company hoped to identify suspicious clients before the company approved them. In other words, the company needs to utilize the fraud detection systems for underwriting as well as for real-time transaction monitoring. As for this exercise, we will focus on scoring transactions with a suspicious score or fraud likelihood score and use this score to monitor all the transactions so the ABC corporation can take actions to stop potential frauds.

To sum up, for this project we have a target variable of fraud and web log data for each transaction, plus account, computing device, and user data.

Through some preliminary analysis, the company understands some of its data challenges, as follows:

  • Data is not ready to use; the web log data especially needs to be extracted into features ready for modeling
  • There are many kinds of fraud cases per payment transaction service with very different behaviors
  • Less information exists for some new and less active clients

Distributed computing

Similarly to the previous chapter, for our project, parallel computing is needed due to the many kinds of frauds for which we should set up clusters and worker notes as before.

Let's assume we continue to work within the Databricks environment:

Distributed computing

Then, we will need to go to the preceding main menu, click on Clusters, then create a name for the cluster, select the newest version of Spark, and then specify the number of workers.

Once the clusters are created, we can go to the preceding illustrated main menu, click on the down arrow to the right of Tables, and select Create Tables to import all of our cleaned and prepared datasets, as per the instructions discussed in Chapter 2, Data Preparation for Spark ML.

For this project, we will need to import a lot of web log data, structured data about the individual users or companies, the computer device used, and also on the bank accounts used.

As before, in Apache Spark, we need to direct workers to complete the computation on each note, for which we will use a scheduler on Databricks to get our R Notebook computation completed, and then collect results.

Also here, we will continue to take an R notebook approach.

In the Databricks environment, setting up notebooks will need us to go into the following menu:

Distributed computing

In the preceding main menu, click on the down arrow to the right of Workspace and select Create -> New Notebook to create a notebook.

If users do not want to use the R notebook provided by Databricks, one option is to use Zepperlin. To build a free notebook on Spark using Zeppelin, go to:

http://hortonworks.com/blog/introduction-to-data-science-with-apache-spark/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.12.186