The Iterative Dichotomiser 3 (ID3) algorithm is one of the most popular designs of the decision induction tree. It is not tolerant of missing values or noisy, and the value of attributes must come from an infinite fixed set.
ID3 uses entropy to calculate the homogeneity of a sample and also for the split. The information gain G
for each attribute A
is computed using the following equation. The root of the final tree is assigned with an attribute with the highest information gain. Then the new subtree is built recursively upon each value of the attribute bound to the root.
ID3 (C4.5 and CART) builds the decision induction tree recursively in a top-down divide-and-conquer manner through the space of possible decision trees with a greedy strategy. Using the greedy search strategy, at each step, a decision that greatly improves the optimizing target is made. For each node, find the test condition best segment the training data assigned to it.
The characteristics of the decision induction tree in the case of ID3 include the following:
The input parameters for ID3 algorithm are as follows:
The output parameter of the algorithm is as follows:
The main function of the R code for the ID3 algorithm is listed as follows. Here data
is the input training dataset, ix
is the set of input attributes, and ox
is the output attribute:
1 ID3 <- function(data,ix,ox){ 2 result.tree <- NULL 3 4 if( IsEmpty(data) ){ 5 node.value <- "Failure" 6 result.tree <- CreateNode(node.value) 7 return(result.tree) 8 } 9 if( IsEqualAttributeValue(data,ox) ){ 10 node.value <- GetMajorityAttributeValue(data,ox) 11 result.tree <- CreateNode(node.value) 12 return(result.tree) 13 } 14 if( IsEmpty(ix) ){ 15 node.value <- GetMajorityAttributeValue(data,ox) 16 result.tree <- CreateNode(node.value) 17 return(result.tree) 18 } 19 gain <- GetInformationGain(data,ix) 20 best.split <- GetBestSplit(data,gain,ix) 21 22 values <- GetAttributeValues(best.split) 23 values.count <- GetAttributeValuesCount(best.split) 24 data.subsets <- SplitData(data,best.split) 25 26 node.value <- best.split 27 result.tree <- CreateNode(node.value) 28 idx <- 0 29 while( idx<=values.count ){ 30 idx <- idx+1 31 newdata <- GetAt(data.subsets,idx) 32 value <- GetAt(values,idx) 33 new.ix <- RemoveAttribute(ix,best.split) 34 new.child <- ID3(newdata,new.ix,ox) 35 AddChildNode(result.tree,new.child,value) 36 } 37 38 result.tree 39 }
Along with the development of information technology, there have emerged many systems that identify malicious usage of the built software system, web system, and so on. One of them is the Intrusion Detection System (IDS), to detect the malicious behavior, conduct content inspection without the firewall. Also includes include signature detection, anomaly detection, and so on.
Classifier-like decision tree technologies, such as ID3, C4.5, and CART, play an important role as analyzers in addition to other important components of IDS, such as sensor, manager, operator, and administrator. The classifications needed here are activity monitor, file integrity checker, host firewall, log parser, and packet pattern matching.
Many issues occur for IDS. One of them is the new variety of a known attack pattern, often with low detection rate by the existing IDS. This drives the design of new types of IDS systems integrated with artificial intelligence, especially decision tree technologies.
Among real world examples, except the ones IDS has already built, there are also competitions for applying data mining techniques to web attack detection. One of them is KDD-Cup. The topic for KDD-Cup 1999 was Computer network intrusion detection, to build a classifier to predict the unauthorized behavior.
The dataset for it came from the DARPA Intrusion Detection Evaluation Program. More than five million data instances are contained in the training dataset and more than two million for test dataset. There are about 24 attack types in the training dataset, and 14 in the test dataset. Each data instance in the dataset contains 41 attributes, 9 for TCP connection, 13 for content features contained in the TCP connection, 9 for traffic features that use a two-second time window and the left for host-related traffic features. All the attacks can be categorized into the following four groups:
By specific transformation, the ID3 algorithm can be applied to various web attack detection datasets with various sizes. When the size of the dataset increases, the performance of ID3 will be kept efficient by parallelization.
For simplicity, one example only takes the following four types of attacks to label a dataset for simple IDS:
All the four types of attacks behave with a common pattern, the web queries with malicious pattern. Normalizing the web queries, the URL and collection of reserved tags, label-specific patterns with the appropriate label in the four types of attacks. After training ID3 on the dataset and applying it to the existing IDS, a better detection rate can be achieved.
Following the growth of credit card usage, there has been a requirement in banking industry—finding high-value credit card customers from all customers to create a more customer-oriented strategy to increase profit. There are similar requirements such as finding interesting rules from the dataset.
To achieve this target, we need to enroll more correct customer attributes (no matter what type they are) to the training data object. The possible choices are transaction records, usage behaviors, customer age, annual income, education background, financial assets, and so on.
There is no need to include all customer-related attributes; the most key attributes on this target should be adopted. The domain experts might be helpful on this.
With the appropriate attributes selected, the ID3 algorithm can be applied here to finally extract sensitive features or representatives to help to judge which customer is more likely to be profitable.
3.139.104.23