Among probabilistic classification algorithms is the Bayes classification, which is based on Bayes' theorem. It predicts the instance or the class as the one that makes the posterior probability maximal. The risk for Bayes classification is that it needs enough data to estimate the joint probability density more reliably.
Given a dataset D with a size n
, each instance or point x
belonging to D
with a dimension of m, for each . To predict the class of any x
, we use the following formula:
Basing on Bayes' theorem, is the likelihood:
Then we get the following new equations for predicting for x
:
With new definitions to predict a class, the prior probability and its likelihood needs to be estimated.
Given the dataset D, if the number of instances in D labeled with class is and the size of D
is n
, we get the estimation for the prior probability of the class as follows:
For numeric attributes, assuming all attributes are numeric, here is the estimation equation. One presumption is declared: each class is normally distributed around some mean with the corresponding covariance matrix . is used to estimate , for :
For categorical attributes, it can also be dealt with similarly but with minor difference.
The R code for the Bayes classification is listed as follows:
1 BayesClassifier <- function(data,classes){
2 bayes.model <- NULL
3
4 data.subsets <- SplitData(data,classes)
5 cards <- GetCardinality(data.subsets)
6 prior.p <- GetPriorProbability(cards)
7 means <- GetMeans(data.subsets,cards)
8 cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
9
10 AddCardinality(bayes.model,cards)
11 AddPriorProbability(bayes.model,prior.p)
12 AddMeans(bayes.model,means)
13 AddCovarianceMatrix(bayes.model,cov.m)
14
15 return(bayes.model)
16 }
17
18 TestClassifier <- function(x){
19 data <- GetTrainingData()
20 classes <- GetClasses()
21 bayes.model <- BayesClassifier(data,classes)
22
23 y <- GetLabelForMaxPostProbability(bayes.model,x)
24
25 return(y)
26 }
One example is chosen to apply the Bayes classification algorithm, in the following section.
A Trojan horse, which is a malicious program, surreptitiously performs its operation under the guise of a legitimate program. It has a specific pattern and unique malicious behavior (such as traffic and other operations). For example, it may obtain account information and sensitive system information for further attacks. It can also fork processes for dynamic ports, impersonate software and redirect traffic of affected services to other systems, make them available to attackers to hijack connections, intercept valuable data, and inject fake information or phishing.
Depending on the purpose of Trojans, there are many versatile types of designs for Trojans, each with a certain traffic behavior. With the ability to identify the Trojan traffic, further processing can be performed to protect information. As a result, detecting the traffic of Trojans is one of the main tasks to detect Trojans on system. The behavior of Trojans is an outlier compared to the normal software. So the classification algorithms such as the Bayesian classification algorithm can be applied to detect the outliers. Here is a diagram showing the Trojan traffic behavior:
The malicious traffic behaviors include but are not limited to spoofing the source IP addresses and (short and long) term scanning the flow of the address/port that serves as the survey for successive attacks. Known Trojan traffic behaviors are used as the positive training data instances. The normal traffic behaviors are used as the negative data instances in the training dataset. These kinds of datasets are continuously collected by NGOs.
The attributes used for a dataset may include the latest DNS request, the NetBIOS name table on the host machine, ARP cache, intranet router table, socket connections, process image, system ports behavior, opened files updates, remote files updates, shell history, packet TCP/IP headers information, identification fields (IPID) of the IP header, Time To Live (TTL), and so forth. One possible attribute set for a dataset is source IP, port, target IP, target port, number of flows, number of packets, number of bytes, timestamp at certain checkpoint, and the class label for the type of detection. The DNS traffic plays an important role in the Trojans' detection too; the traffics of Trojans has certain a relation with DNS traffic.
The traditional technologies for detecting a Trojan often rely on the Trojan's signature and can be deceived by dynamic ports, encrypted messages, and so on. This led to the introduction of mining technologies for the classification of Trojan traffic. The Bayesian classifier is one of the better solutions among others. The preceding diagram is one such possible structure.
3.138.204.96