This chapter covers the following items:
–Algorithm for maximum posteriori hypothesis
–Algorithm for classifying with probability modeling
–Examples and applications
This chapter provides brief information on the Bayesian classifiers that are regarded as statistical classifiers. They are capable of predicting class membership probabilities such as the probability (with a given dataset belonging to a particular class). Bayesian classification is rooted in Bayes’ theorem. The naive Bayesian classifier has also been revealed as a result of the studies that performed the classification of algorithms. In terms of performance, simple Bayesian classifier is known to be equivalent to decision tree and some other neural network classifiers. Bayesian classifiers have proven to yield high speed as well as accuracy when they are applied to large datasets as well. Naive Bayesian classifiers presume that the value effect of an attribute pertaining to a given class is independent of the other attributes’ values. Such an assumption is named as the class conditional independence. It makes the computations involved more simple; for this reason it has been named as “naive”. First, we shall deal with the basic probability notation and Bayes’ theorem. Later, we will provide the way of doing a naive Bayesian classification [1–7].
Thomas Bayesis the mastermind of Bayes’ theorem. He was an English clergy working on subjects such as decision theory as well as probability in the eighteenth century [8]. As for the explanation, we can present the following: X is a dataset. According to Bayesian terminology, X is regarded as the “evidence” and described by measurements performed on a set of n attributes. H is a sort of hypothesis like the dataset X belongs to a particular specific class C. One may like to agree on the probability held by the hypothesis H with the “evidence” or observed dataset X. One seeks the probability that dataset X belongs to class C, given that the attribute description of X is known.
is the posterior probability or it is also known as a posteriori probability, of H that is conditioned on X. For example, based on the dataset presented in Table 2.8 Year and Inflation (consumer prices [annual %]) attributes are the relevant ones defined and limited by the economies. Suppose that X has an inflation amounting to $ 575.000 (belonging to the year 2003 for New Zealand). Let us also suppose that H will have profit based on the inflation of the economy in the next 2 years. Since we know the Inflation (consumer prices [annual %]) value of the New Zealand economy belonging to 2003 we can calculate the possibility of the profit situation. Similarly, is the posterior possibility of X over H. This means that as we know that the New Zealand economy is in a profitable situation within that year, X is the possibility of getting an Inflation (consumer prices [annual %]) to $ 575.000 in 2003. P(X) is the preceding possibility of X [9].
For the estimation of the probabilities, P(X), P(X) may well be predicted from the given data. It is acknowledged that Bayes’ theorem is useful because it provides individuals with a way of calculating the posterior probability from P(H), and P(X). Bayes’ theorem is denoted as follows:
Now let us have a look at the way Bayes’ theorem is used in naive Bayesian classification [10]. As we have mentioned earlier, it is known as the simple Bayesian classifier. The working thereof can be seen as follows:
Let us say D is a training set of datasets and their class labels associated. As always is the case, each dataset is depicted by an n-dimensional attribute vector, X, D. x1, x2,: : : xn, showing n measurements that are made on the dataset from n attributes, in the relevant order A1, A2,. . ., An.
Let us think that there are m classes, C1, C2,. . ., Cm. There is dataset denoted as X. Here the classifier will make the prediction that X belongs to the class that has the maximum posterior probability conditioned on X. In other words, the naive Bayesian classifier makes the prediction that dataset X belongs to the Ci class only if the following conditions are fulfilled: first of all, D being a training set of datasets and their associated class labels, each dataset is denoted by an n-dimensional attribute vector, X = (x1, x2, . . . , xn). This depicts n measurements made on the dataset derived from corresponding n attributes, A1, A2,. . ., An. Second, let us assume that there exist m classes, C1, C2, . . ., Cm. With the dataset, X, the classifier will make the prediction that X belongs to the class that has the maximum posterior probability conditioned on X. In this case, the naive Bayesian classifier estimates that dataset X belongs to the Ci class and if for 1 ≤ j ≤m, j ≠ i. This is a requirement for the X to belong to Ci class. In this case, is maximized and is called the maximum posterior hypothesis. Through Bayes’ theorem (eq. 7.2), the relevant calculation is done as follows [10–12]:
It is necessary only to maximize since is constant for all of the classes, if you do not know the class prior probabilities, it is common for one to suppose that the classes are likely to be equal with for this reason, one would maximize If it is not maximized, then one may maximize Class prior probabilities can be estimated by in which |Ci,D| the is number of training datasets pertaining to class Ci in D.
If datasets have many attributes, then the process of computation of could be expensive. For the reduction of computation in the evaluation of the naive assumption of class-conditional independence is performed, which supposes that the values of the attribute values are independent of each other on conditional bases, showing that dependence relationship does not exist among the attributes.
We obtain the following:
It is easy to estimate the probabilities from the training datasets. For each attribute, one examines if the attribute is categorical or one with a continuous value. For example, if you would like to calculate you take into account the following:
(a) | If Ak is categorical, is the number of class datasets Ci in D that has the value xk for Ak. It is divided by which is the number of class datasets Ci in D. |
(b) | If Ak has a continuous value, there is some more work to be done (with calculation being relatively straightforward, though). It is assumed that a continuous-valued attribute has a Gaussian distribution with a mean μ and standard deviation, σ, you can define by using the following formulae given below [13]: |
to get the following:
For some, such equation might be intimidating; however, you have to calculate μCi and σCi, which are the mean or average; and standard deviation, respectively, of the values of Ak attribute for the training of the datasets pertaining to class Ci. Afterward, these two quantities are capped into eq. (7.6), along with xk, for the estimation of
For the prediction of X, the class label, is taken into consideration and assessed for each class Ci. The classifier makes the prediction that the class label of dataset X is the class Ci only if the following is satisfied (predicted class label is the class Ci for which is the highest value) [14]:
There is a question about the effectiveness of Bayesian classifiers. To answer the question to what extent the Bayesian classifiers is, some empirical studies have been conducted for making comparisons. Studies on decision tree and neural network classifiers have revealed Bayesian classifiers are comparable in some scopes. Theoretically speaking, Bayesian classifiers yield the minimum error rate when compared with all of the other classifiers. Yet, in real life or this is not always applicable in practice due to the inaccuracies in the assumptions pertaining to its use.
If we return back to the uses of Bayesian classifiers, they are believed to be beneficial since they supply a theoretical validation and rationalization for the other classifiers that do not employ Bayes’ theorem explicitly. In some assumptions, many curve-fitting algorithms and neural network algorithms output the maximum posteriori hypothesis, as is the case with the naive Bayesian classifier [15].
The flowchart of Bayesian classifier algorithm has been provided in Figure 7.1.
Figure 7.2 provides the steps of Bayesian classifier algorithm in detail.
The procedural steps regarding the naive Bayes classifier algorithm in Figure 7.2 are explained in order given below:
Step (1) Maximum posterior probability conditioned on X. m classes exist as follows: C1, C2,. . .,Cm. There is dataset denoted as X. D being a training set of datasets and their associated class labels, each dataset is denoted by an n-dimensional attribute vector, X = (x1, x2,. . ., xn). Second, let us suppose that there exist m classes, With the dataset, X, the classifier will make the prediction that X belongs to the class that has the maximum posterior probability conditioned on X.
Step (2–4) Classifier estimates that dataset X belongs to the Ci class and if
for 1 ≤ j ≤m, j ≠ i. This is a requirement for the X to belong to Ci class.
Step (5–9) In this case, is maximized and called the maximum posterior hypothesis. Through Bayes’ theorem (eq. 7.6) the relevant calculation is done as follows:
It is required only to maximize because is constant for all of the classes, if you do not know the class prior probabilities, it is common to assume that the classes are likely to be equal with hence, one would maximize If not maximized, then one may maximize Estimation of class prior probabilities can be done by in which is the number of training datasets for class
Example 7.1 Sample dataset (see Table 2.19) has been chosen from WAIS-R dataset, that is, by using Naive Bayes classifier algorithm, let us calculate the probability of which class (patient or healthy) an individual will belong to. Let us assume that the individual has following attributes (Table 7.1):
x1 : School education = Bachelor, x2 : Age = Middle, x3 : Gender = Female
Table 7.1: Sample WAIS-R Dataset (School education, age and gender) on Table 2.19.
Solution 7.1 In order to calculate the probability as to which class (whether patient or healthy) an individual (with the following attributes = School Education: Bachelor, x2 = Age: Middle, x3 = Gender: Female) will belong to, let us apply the Naive Bayes algorithm in Figure 7.2.
Step (1) Let us calculate the probability of which class (patient or healthy) the individuals with School education, Age and Gender attributes in WAIS-R sample dataset in Table 7.2 will belong to.
The probability of class to belong to for each attribute is calculated in Table 7.2. Classification of the individuals is labeled as C1 = Patient and C2 = Healthy. The probability for each attribute (School education, Age, Gender) belonging to which class (Patient, Healthy) has been calculated through (see eq. 7.2). The steps for this probability can be found as follows:
Step (2–7) (a)
and (b)
calculation for the probability;
(a)
calculation for the conditional probability: For the calculation of the conditional probability of
the probability calculations are done distinctively for the X, D = [x1, x2, . . . , x5] values.
Table 7.2: Sample WAIS-R Data Set conditional probabilities.
Thus, using eq. (7.3), following result is obtained:
Now, let us also calculate the probability of P(Class = Patient)
Thus,
The probability of an individual with the x1 = School education: Bachelor, x2 = Age: Middle, x3 = Gender: Female, attributes to belong to Patient class has been obtained as 0.03 approximately.
(b)
calculation for the probability
In order to calculate the conditional probability of
the probability calculations are done distinctively for the X, D = [x1, x2, x3] values.
Thus, using eq. (7.3), following result is obtained:
Now, let us also calculate the probability of P(Class = Healthy).
Thus,
The probability of an individual with the x1 = School education: Bachelor, x2 = Age: Middle, x3 = Gender: Female, attributes to belong to Healthy class has been obtained as 0.027 approximately.
Step (8–9) Maximum posterior probability is conditioned on X. According to maximum posterior probability method, value is calculated.
Maximum probability value is taken from the results obtained in Step (2–7).
It has been identified that an individual (with the attributes given above) with an obtained probability result of 0.03 belongs to Patient class.
Bayesian network algorithm has been applied on economy (U.N.I.S.) dataset, multiple sclerosis (MS) dataset and WAIS-R dataset in Sections 7.1.1, 7.1.2 and 7.1.3, respectively.
As the second set of data, we will use in the following sections, some data related to economies for U.N.I.S. countries. The attributes of these countries’ economies are data regarding years, gross value added at factor cost ($), tax revenue, net domestic credit and gross domestic product (GDP) growth (annual %). Economy (U.N.I.S.) Dataset is made up of a total of 18 attributes. Data belonging to economies of USA, New Zealand, Italy and Sweden from 1960 to 2015 are defined on the basis of attributes provided in Table 2.8 Economy (U.N.I.S.) Dataset (http://data.worldbank. org) [15], to be used in the following sections. For the classification of D matrix through Naive Bayesian algorithm the first-step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (151 × 18), and 33.33% as the test dataset (77 × 18).
Following the classification of the training dataset being trained with Naive Bayesian algorithm, we can perform the classification of the test dataset. It is normal that the steps for the procedure in the algorithm may seem complex at first look but if you focus on the steps and develop an insight into them, it would be sufficient to go ahead. At this point, we can take a close look at the steps in Figure 7.4.
Let us do the step by step analysis of the Economy Dataset according to Bayesian classifier in Example 7.2 based on Figure 7.3.
Example 7.2 The sample Economy Dataset (see Table 7.3 ) has been chosen from Economy (U.N.I.S.) Dataset (see Table 2.8). Let us calculate the probability as to which class (U.N.I.S.) a country with
x1 : Net Domestic Credit = 49, 000 (Current LCU),
x2 : Tax Revenue = 2, 934(% of GDP), x3 :Year = 2003 attributes (Table 7.3) will fall under Naive Bayes classifier algorithm.
Solution 7.2 In order to calculate the probability as to which class USA, New Zealand, Italy and Sweden, a country with the following
x1 : Net Domestic Credit = 49, 000 (Current LCU),
x2 : Tax Revenue = 2, 934(% of GDP), x3 :Year = 2003) attributes will belong to, let us apply the Naive Bayes algorithm in Figure 7.3.
Table 7.3: Selected sample from the Economy Dataset.
Step (1) Let us calculate the probability of which class (U.N.I.S.) the economies with net domestic credit, tax revenue and year attributes in economics sample dataset in Table 7.4 will belong to.
The probability of class belonging to each attribute is calculated in Table 7.4. Classification of the countries ID is labeled as C1 = UnitedStates, C2 = NewZealand, C3 = Italy, C4 = Sweden. The probability for each attributes (net domestic credit, tax revenue and year) belonging to which class: USA, New Zealand, Italy, Sweden has been calculated through (see eq. 7.2). The steps for this probability can be found as follows:
Step (2–7) (a) P(Sweden)calculationfortheprobability;
(a) calculation of the conditional probability
Here, for calculating the conditional probability of the probability calculations are done distinctively for the X = {x1, x2, . . . , xn} values.
Using eq. (7.3), following equation is obtained:
Now, let us also calculate the probability of
Thus, the following equation is calculated:
It has been identified that an economy with the
x1 : Net Domestic Credit = 49, 000 (Current LCU),
x2 : Tax Revenue = 2, 934(% of GDP), x3 :Year = 2003 attributes has a probability result of 0.111 about belonging to USA class.
(b)The conditional probability for New Zealand class P(NewZealand) and has been obtained as 0.
(c)The conditional probability for Italy class and has been obtained as 0.
(d) conditional probability calculation.
18.191.117.57