Dataset

We'll work with a dataset describing insurance transactions, which is publicly available in the Oracle database online documentation at http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/anomalies.htm.

The dataset describes insurance claims on vehicle incidents for an undisclosed insurance company. It contains 15,430 claims; each claim is comprised of 33 attributes, describing the following components:

Customer demographic details (Age, Sex, MartialStatus, and so on)
Purchased policy (PolicyType, VehicleCategory, number of supplements, agent type, and so on)
Claim circumstances (day/month/week claimed, policy report filed, witness present, past days between incident-policy report, incident claim, and so on)
Other customer data (number of cars, previous claims, DriverRating, and so on)
Fraud found (yes or no)

The sample of the database shown in the following screenshot depicts the data that's been loaded into Weka:

Now, the task is to create a model that will be able to identify suspicious claims in the future. The challenging thing about this task is the fact that only 6% of the claims are suspicious. If we create a dummy classifier saying that no claim is suspicious, it will be accurate in 94% of cases. Therefore, in this task, we will use different accuracy measures: precision and recall.

Let's recall the outcome table from Chapter 1, Applied Machine Learning Quick Start, where there are four possible outcomes, denoted as true positive, false positive, false negative, and true negative:

		Classified as
Actual		Fraud	No fraud
	Fraud	TP - true positive	FN - false negative
	No fraud	FP - false positive	TN - true negative

Precision and recall are defined as follows:

Precision is equal to the proportion of correctly raised alarms, as follows:

Recall is equal to the proportion of deviant signatures, which are correctly identified as follows:

With these measures–our dummy classifier scores–we find that Pr = 0 and Re = 0, as it never marks any instance as fraud (TP = 0). In practice, we want to compare classifiers by both numbers; hence, we use F - measure. This is a de facto measure that calculates a harmonic mean between the precision and recall, as follows:

Now, let's move on to designing a real classifier.

Table of Contents for Dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Dataset