In this section, we will perform a case study with real-world machine learning datasets to illustrate some of the concepts from Bayesian networks.
We will use the UCI Adult dataset, also known as the Census Income dataset (http://archive.ics.uci.edu/ml/datasets/Census+Income). This dataset was extracted from the United States Census Bureau's 1994 census data. The donors of the data is Ronny Kohavi and Barry Becker, who were with Silicon Graphics at the time. The dataset consists of 48,842 instances with 14 attributes, with a mix of categorical and continuous types. The target class is binary.
The problem consists of predicting the income of members of a population based on census data, specifically, whether their income is greater than $50,000.
This is a problem of classification and this time around we will be training Bayesian graph networks to develop predictive models. We will be using linear, non-linear, and ensemble algorithms, as we have done in experiments in previous chapters.
In the original dataset, there are 3,620 examples with missing values and six duplicate or conflicting instances. Here we include only examples with no missing values. This set, without unknowns, is divided into 30,162 training instances and 15,060 test instances.
The features and their descriptions are given in Table 3:
Table 3. UCI Adult dataset – features
The dataset is split by label as 24.78% (>50K) to 75.22% (<= 50K). Summary statistics of key features are given in Figure 25:
We will perform detailed analysis on the Adult dataset using different flavors of Bayes network structures and with regular linear, non-linear, and ensemble algorithms. Weka also has an option to visualize the graph model on the trained dataset using the menu item, as shown in Figure 26. This is very useful when the domain expert wants to understand the assumptions and the structure of the graph model. If the domain expert wants to change or alter the network, it can be done easily and saved using the Bayes Network editor.
Figure 27 shows the visualization of the trained Bayes Network model's graph structure:
The algorithms used for experiments are:
Table 4 presents the evaluation metrics for all the learners used in the experiments, including Bayesian network classifiers as well as the non-Bayesian algorithms:
Algorithms |
TP Rate |
FP Rate |
Precision |
Recall |
F-Measure |
MCC |
ROC Area |
PRC Area |
---|---|---|---|---|---|---|---|---|
Naïve Bayes (Kernel Estimator) |
0.831 |
0.391 |
0.821 |
0.831 |
0.822 |
0.494 |
0.891 |
0.906 |
Naïve Bayes (Discretized) |
0.843 |
0.191 |
0.861 |
0.843 |
0.848 |
0.6 |
0.917 |
0.93 |
TAN (K2, 3 Parents, Simple Estimator) |
0.859 |
0.273 |
0.856 |
0.859 |
0.857 |
0.6 |
0.916 |
0.931 |
BayesNet (K2, 3 Parents, Simple Estimator) |
0.863 |
0.283 |
0.858 |
0.863 |
0.86 |
0.605 |
0.934 |
0.919 |
BayesNet (K2, 2 Parents, Simple Estimator) |
0.858 |
0.283 |
0.854 |
0.858 |
0.855 |
0.594 |
0.917 |
0.932 |
BayesNet (Hill Climbing, 3 Parents, Simple Estimator) |
0.862 |
0.293 |
0.857 |
0.862 |
0.859 |
0.602 |
0.918 |
0.933 |
Logistic Regression |
0.851 |
0.332 |
0.844 |
0.851 |
0.845 |
0.561 |
0.903 |
0.917 |
KNN (10) |
0.834 |
0.375 |
0.824 |
0.834 |
0.826 |
0.506 |
0.867 |
0.874 |
Decision Tree (J48) |
0.858 |
0.300 |
0.853 |
0.858 |
0.855 |
0.590 |
0.890 |
0.904 |
AdaBoostM1 |
0.841 |
0.415 |
0.833 |
0.841 |
0.826 |
0.513 |
0.872 |
0.873 |
Random Forest |
0.848 |
0.333 |
0.841 |
0.848 |
0.843 |
0.555 |
0.896 |
0.913 |
Table 4. Classifier performance metrics
Naïve Bayes with supervised discretization shows relatively better performance than kernel estimation. This gives a useful hint that discretization, which is needed in most Bayes networks, will play an important role.
The results in the table show continuous improvement when Bayes network complexity is increased. For example, Naïve Bayes with discretization assumes independence from all features and shows a TP rate of 84.3, the TAN algorithm where there can be one more parent shows a TP rate of 85.9, and BN with three parents shows the best TP rate of 86.2. This clearly indicates that a complex BN with some nodes having no more than three parents can capture the domain knowledge and encode it well to predict on unseen test data.
Bayes network where structure is learned using search and score (with K2 search with three parents and scoring using Bayes score) and estimation is done using simple estimation, performs the best in almost all the metrics of the evaluation, as shown in the highlighted values.
There is a very small difference between Bayes Networks—where structure is learned using search and score of Hill Climbing—and K2, showing that even local search algorithms can find an optimum.
Bayes network with a three-parent structure beats most linear, non-linear, and ensemble methods such as AdaBoostM1 and Random Forest on almost all the metrics on unseen test data. This shows the strength of BNs in not only learning the structure and parameters on small datasets with large number of missing values as well as predicting well on unseen data, but in beating other sophisticated algorithms too.
52.15.223.168