In this recipe, we will implement decision making to guide direct marketing in a company. We will use data from a real-world business problem that contains information on customers of an insurance company supplied by the Dutch data-mining company Sentient Machine Research. It consists of 86 variables and includes product usage data and socio-demographic data derived from zip area codes. All customers living in areas with the same zip code have the same socio-demographic attributes. The training set contains over 5000 descriptions of customers, while a test set contains 4000 customers. The goal is to predict which of the customers in the train set will buy a caravan insurance policy.
Here is the reference to the TIC Benchmark / CoiL Challenge 2000 data (http://www.liacs.nl/~putten/library/cc2000/):
P. van der Putten and M. van Someren (eds). CoIL Challenge 2000: The Insurance Company Case. Published by Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report 2000-09. June 22, 2000.
The dataset is available at the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+(COIL+2000)).
The attributes and their values are described in the TicDataDescr.txt
file. The dataset consists of several separated files with one instance per line with tab delimited fields:
TICDATA2000.txt
: Dataset to train and validate prediction models (5822 customer records). Attribute 86, CARAVAN: Number of mobile home policies
, is the target variable.TICEVAL2000.txt
: Dataset for predictions (4000 customer records). It has the same format as TICDATA2000.txt
, only the target is missing.TICTGTS2000.txt
: Targets for the evaluation set.The plan is as follows. First, we will load the coil-train.arff
dataset into Weka, which was preprocessed from TICDATA.txt
to start with the recipe immediately.. Then, we will build some prediction models and perform 10-fold-cross validation to obtain preliminary performance. We will select the best one and use it to predict the target value using 10-fold cross-validation. Finally, for the sake of completeness, we will use the best classifier to predict the actual values in the coil-test.arff
file, although this won't be possible in the real world:
import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.util.Random; import weka.core.converters.ArffLoader; import weka.classifiers.Classifier; import weka.classifiers.Evaluation; import weka.classifiers.bayes.NaiveBayes; import weka.classifiers.trees.J48; import weka.core.Instance; import weka.core.Instances; public class DirectMarketing { public static void main(String args[]) throws Exception { Instances trainData = new Instances(new BufferedReader(new FileReader("dataset/coil-train.arff"))); trainData.setClassIndex(trainData.numAttributes() - 1); J48 j48 = new J48(); j48.setOptions(new String[]{ "-C", "0.25", //set confidence factor "-M", "2" //set min num of instances in leaf nodes }); double precision = crossValidation(j48, trainData); System.out.println(precision); NaiveBayes nb = new NaiveBayes(); precision = crossValidation(nb, trainData); nb.buildClassifier(trainData); ArffLoader loader = new ArffLoader(); loader.setFile(new File("dataset/coil-test.arff")); Instances testData = loader.getStructure(); testData.setClassIndex(trainData.numAttributes() - 1); Instance current; while ((current = loader.getNextInstance(testData)) != null){ double cls = nb.classifyInstance(current); System.out.println(cls); } } public static double crossValidation(Classifier cls, Instances data) throws Exception{ Evaluation eval = new Evaluation(data); eval.crossValidateModel(cls, data, 10, new Random(1)); System.out.println(eval.toSummaryString(false)); System.out.println(eval.precision(1)); return eval.precision(1); } }
And that's it.
Now, let's take a closer look at the code. First, make the following imports:
import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.util.Random; import weka.core.converters.ArffLoader; import weka.classifiers.Classifier; import weka.classifiers.Evaluation; import weka.classifiers.bayes.NaiveBayes; import weka.classifiers.trees.J48; import weka.core.Instance; import weka.core.Instances; public class DirectMarketing { public static void main(String args[]) throws Exception {
Next, load the train data:
Instances trainData = new Instances(new BufferedReader(new FileReader("dataset/coil-train.arff"))); trainData.setClassIndex(trainData.numAttributes() - 1);
Build the decision tree and the Naive Bayes classifier, and evaluate them on the train data with 10-fold-cross validation:
J48 j48 = new J48(); j48.setOptions(new String[]{ "-C", "0.25", //set confidence factor "-M", "2" //set min num of instances in leaf nodes }); double precision = crossValidation(j48, trainData); System.out.println(precision); NaiveBayes nb = new NaiveBayes(); precision = crossValidation(nb, trainData);
Naive Bayes achieves better performance, so we will use it to classify real data:
nb.buildClassifier(trainData);
Load the test data:
ArffLoader loader = new ArffLoader(); loader.setFile(new File("dataset/coil-test.arff")); Instances testData = loader.getStructure(); testData.setClassIndex(trainData.numAttributes() - 1);
Classify each instance and output results:
Instance current; while ((current = loader.getNextInstance(testData)) != null){ double cls = nb.classifyInstance(current); System.out.println(cls); } }
The helper method that performs 10-fold-cross validation:
public static double crossValidation(Classifier cls, Instances data) throws Exception{
Initialize new evaluation objects:
Evaluation eval = new Evaluation(data); eval.crossValidateModel(cls, data, 10, new Random(1)); System.out.println(eval.toSummaryString(false));
Since the dataset is very unbalanced, use precision for target class yes
as an evaluation measure:
System.out.println(eval.precision(1)); return eval.precision(1); } }
The output precision shows how accurately the model predicts who will respond to the campaign.
3.144.41.229