High-value credit card customers classification using ID3

The Iterative Dichotomiser 3 (ID3) algorithm is one of the most popular designs of the decision induction tree. It is not tolerant of missing values or noisy, and the value of attributes must come from an infinite fixed set.

ID3 uses entropy to calculate the homogeneity of a sample and also for the split. The information gain G for each attribute A is computed using the following equation. The root of the final tree is assigned with an attribute with the highest information gain. Then the new subtree is built recursively upon each value of the attribute bound to the root.

High-value credit card customers classification using ID3
High-value credit card customers classification using ID3

Note

With the play golf dataset as the input dataset, you can calculate the information gain and list using the following formulas:

  • Entropy (root) = 0.940
  • Gain () = 0.048, Gain (S, Humidity) = 0.151
  • Gain (S, Temperature) = 0.029, Gain (S, Outlook) = 0.246

ID3 (C4.5 and CART) builds the decision induction tree recursively in a top-down divide-and-conquer manner through the space of possible decision trees with a greedy strategy. Using the greedy search strategy, at each step, a decision that greatly improves the optimizing target is made. For each node, find the test condition best segment the training data assigned to it.

The characteristics of the decision induction tree in the case of ID3 include the following:

  • Each node excluding the leaf of the tree corresponds to an input attribute, each arc to a possible value of that attribute
  • Entropy is used to determine how informative a particular input attribute is about the output class on a given dataset
  • The recursive algorithm

Note

A quick description about the recursive algorithm can be defined as follows:

  • Break the original problem into two or more smaller-sized problems with the same type
  • Call the recursive algorithm on each smaller type of problem
  • Group together the results of step 2 to solve the original problem

The ID3 algorithm

The input parameters for ID3 algorithm are as follows:

  • I, denotes the set of input attributes, which may be tested by the result decision tree
  • T, the set of training data objects, or training examples

The output parameter of the algorithm is as follows:

  • O, denotes the set of output attribute, that is, the value of those attributes will be predicted by the tree

Here is the pseudocode of the general algorithm:

The ID3 algorithm
The ID3 algorithm

The R implementation

The main function of the R code for the ID3 algorithm is listed as follows. Here data is the input training dataset, ix is the set of input attributes, and ox is the output attribute:

  1 ID3 <- function(data,ix,ox){
  2    result.tree <- NULL
  3 
  4    if( IsEmpty(data) ){
  5        node.value <- "Failure"
  6        result.tree <- CreateNode(node.value)
  7        return(result.tree)
  8    }
  9    if( IsEqualAttributeValue(data,ox) ){
 10        node.value <- GetMajorityAttributeValue(data,ox)
 11        result.tree <- CreateNode(node.value)
 12        return(result.tree)
 13    }
 14    if( IsEmpty(ix) ){
 15        node.value <- GetMajorityAttributeValue(data,ox)
 16        result.tree <- CreateNode(node.value)
 17        return(result.tree)
 18    }
 19    gain <- GetInformationGain(data,ix)
 20    best.split <- GetBestSplit(data,gain,ix)
 21 
 22    values <- GetAttributeValues(best.split)
 23    values.count <- GetAttributeValuesCount(best.split)
 24    data.subsets <- SplitData(data,best.split)
 25 
 26    node.value <- best.split
 27    result.tree <- CreateNode(node.value)
 28    idx <- 0
 29    while( idx<=values.count ){
 30        idx <- idx+1
 31        newdata <- GetAt(data.subsets,idx)
 32        value <- GetAt(values,idx)
 33        new.ix <- RemoveAttribute(ix,best.split)
 34        new.child <- ID3(newdata,new.ix,ox)
 35        AddChildNode(result.tree,new.child,value)
 36    }
 37 
 38    result.tree
 39 }

Web attack detection

Along with the development of information technology, there have emerged many systems that identify malicious usage of the built software system, web system, and so on. One of them is the Intrusion Detection System (IDS), to detect the malicious behavior, conduct content inspection without the firewall. Also includes include signature detection, anomaly detection, and so on.

Classifier-like decision tree technologies, such as ID3, C4.5, and CART, play an important role as analyzers in addition to other important components of IDS, such as sensor, manager, operator, and administrator. The classifications needed here are activity monitor, file integrity checker, host firewall, log parser, and packet pattern matching.

Many issues occur for IDS. One of them is the new variety of a known attack pattern, often with low detection rate by the existing IDS. This drives the design of new types of IDS systems integrated with artificial intelligence, especially decision tree technologies.

Among real world examples, except the ones IDS has already built, there are also competitions for applying data mining techniques to web attack detection. One of them is KDD-Cup. The topic for KDD-Cup 1999 was Computer network intrusion detection, to build a classifier to predict the unauthorized behavior.

The dataset for it came from the DARPA Intrusion Detection Evaluation Program. More than five million data instances are contained in the training dataset and more than two million for test dataset. There are about 24 attack types in the training dataset, and 14 in the test dataset. Each data instance in the dataset contains 41 attributes, 9 for TCP connection, 13 for content features contained in the TCP connection, 9 for traffic features that use a two-second time window and the left for host-related traffic features. All the attacks can be categorized into the following four groups:

  • DOS: This refers to denial of service
  • R2L: This refers to unauthorized access to the local machine from the remote machine
  • U2R: This refers to unauthorized access to local super-user privileges by a local unprivileged user
  • Probing: This refers to surveillance and probing

By specific transformation, the ID3 algorithm can be applied to various web attack detection datasets with various sizes. When the size of the dataset increases, the performance of ID3 will be kept efficient by parallelization.

For simplicity, one example only takes the following four types of attacks to label a dataset for simple IDS:

  • SQL Injection
  • Cross Site Scripting
  • Code Injection
  • Directory Traversal

All the four types of attacks behave with a common pattern, the web queries with malicious pattern. Normalizing the web queries, the URL and collection of reserved tags, label-specific patterns with the appropriate label in the four types of attacks. After training ID3 on the dataset and applying it to the existing IDS, a better detection rate can be achieved.

High-value credit card customers classification

Following the growth of credit card usage, there has been a requirement in banking industry—finding high-value credit card customers from all customers to create a more customer-oriented strategy to increase profit. There are similar requirements such as finding interesting rules from the dataset.

To achieve this target, we need to enroll more correct customer attributes (no matter what type they are) to the training data object. The possible choices are transaction records, usage behaviors, customer age, annual income, education background, financial assets, and so on.

There is no need to include all customer-related attributes; the most key attributes on this target should be adopted. The domain experts might be helpful on this.

With the appropriate attributes selected, the ID3 algorithm can be applied here to finally extract sensitive features or representatives to help to judge which customer is more likely to be profitable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.104.23