Data mining in direct marketing (Simple)

In this recipe, we will implement decision making to guide direct marketing in a company. We will use data from a real-world business problem that contains information on customers of an insurance company supplied by the Dutch data-mining company Sentient Machine Research. It consists of 86 variables and includes product usage data and socio-demographic data derived from zip area codes. All customers living in areas with the same zip code have the same socio-demographic attributes. The training set contains over 5000 descriptions of customers, while a test set contains 4000 customers. The goal is to predict which of the customers in the train set will buy a caravan insurance policy.

Here is the reference to the TIC Benchmark / CoiL Challenge 2000 data (http://www.liacs.nl/~putten/library/cc2000/):

P. van der Putten and M. van Someren (eds). CoIL Challenge 2000: The Insurance Company Case. Published by Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report 2000-09. June 22, 2000.

Getting ready

The dataset is available at the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+(COIL+2000)).

The attributes and their values are described in the TicDataDescr.txt file. The dataset consists of several separated files with one instance per line with tab delimited fields:

  • TICDATA2000.txt: Dataset to train and validate prediction models (5822 customer records). Attribute 86, CARAVAN: Number of mobile home policies, is the target variable.
  • TICEVAL2000.txt: Dataset for predictions (4000 customer records). It has the same format as TICDATA2000.txt, only the target is missing.
  • TICTGTS2000.txt: Targets for the evaluation set.

How to do it...

The plan is as follows. First, we will load the coil-train.arff dataset into Weka, which was preprocessed from TICDATA.txt to start with the recipe immediately.. Then, we will build some prediction models and perform 10-fold-cross validation to obtain preliminary performance. We will select the best one and use it to predict the target value using 10-fold cross-validation. Finally, for the sake of completeness, we will use the best classifier to predict the actual values in the coil-test.arff file, although this won't be possible in the real world:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.Random;


import weka.core.converters.ArffLoader;
import weka.classifiers.Classifier;
import weka.classifiers.Evaluation;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.trees.J48;
import weka.core.Instance;
import weka.core.Instances;

public class DirectMarketing {

  public static void main(String args[]) throws Exception {
    
    Instances trainData = new Instances(new BufferedReader(new FileReader("dataset/coil-train.arff")));
    
    trainData.setClassIndex(trainData.numAttributes() - 1);

    J48 j48 = new J48();
    j48.setOptions(new String[]{
      "-C", "0.25",  //set confidence factor
      "-M", "2"    //set min num of instances in leaf nodes
    });
    double precision = crossValidation(j48, trainData);
    System.out.println(precision);
    
    NaiveBayes nb = new NaiveBayes();
    precision = crossValidation(nb, trainData);
    
    nb.buildClassifier(trainData);
    
    ArffLoader loader = new ArffLoader();
    loader.setFile(new File("dataset/coil-test.arff"));
    Instances testData = loader.getStructure();
    testData.setClassIndex(trainData.numAttributes() - 1);
    Instance current;
    while ((current = loader.getNextInstance(testData)) != null){
       double cls = nb.classifyInstance(current);
       System.out.println(cls);
    }
  }
  
  public static double crossValidation(Classifier cls, Instances data) throws Exception{
    
    Evaluation eval = new Evaluation(data);
    eval.crossValidateModel(cls, data, 10, new Random(1));
    System.out.println(eval.toSummaryString(false));
    System.out.println(eval.precision(1));
    return eval.precision(1);
  }

} 

And that's it.

How it works...

Now, let's take a closer look at the code. First, make the following imports:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.Random;

import weka.core.converters.ArffLoader;
import weka.classifiers.Classifier;
import weka.classifiers.Evaluation;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.trees.J48;
import weka.core.Instance;
import weka.core.Instances;

public class DirectMarketing {

  public static void main(String args[]) throws Exception {

Next, load the train data:

    Instances trainData = new Instances(new BufferedReader(new FileReader("dataset/coil-train.arff")));
    
    trainData.setClassIndex(trainData.numAttributes() - 1);

Build the decision tree and the Naive Bayes classifier, and evaluate them on the train data with 10-fold-cross validation:

    J48 j48 = new J48();
    j48.setOptions(new String[]{
      "-C", "0.25",	//set confidence factor
      "-M", "2"		//set min num of instances in leaf nodes
    });
    double precision = crossValidation(j48, trainData);
    System.out.println(precision);
    
    NaiveBayes nb = new NaiveBayes();
    precision = crossValidation(nb, trainData);

Naive Bayes achieves better performance, so we will use it to classify real data:

    nb.buildClassifier(trainData);

Load the test data:

    ArffLoader loader = new ArffLoader();
    loader.setFile(new File("dataset/coil-test.arff"));
    Instances testData = loader.getStructure();
    testData.setClassIndex(trainData.numAttributes() - 1);

Classify each instance and output results:

    Instance current;
    while ((current = loader.getNextInstance(testData)) != null){
       double cls = nb.classifyInstance(current);
       System.out.println(cls);
    }
  }

The helper method that performs 10-fold-cross validation:

  public static double crossValidation(Classifier cls, Instances data) throws Exception{

Initialize new evaluation objects:

    Evaluation eval = new Evaluation(data);
    eval.crossValidateModel(cls, data, 10, new Random(1));
    System.out.println(eval.toSummaryString(false));

Since the dataset is very unbalanced, use precision for target class yes as an evaluation measure:

    System.out.println(eval.precision(1));
    return eval.precision(1);
  }

} 

The output precision shows how accurately the model predicts who will respond to the campaign.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.41.229