For this example of working with Naïve Bayes in R, we are going to use the Titanic
dataset. The classification problem we have is to know whether or not individuals died in the Titanic accident. We will create a training dataset and a testing dataset (in order to test how well the classifier performs).
The first thing we need to know is how to convert the Titanic dataset (of class table
) to a data frame:
1 Titanic.df_weighted = data.frame(Titanic)
Let's have a look at the dataset:
Class |
Sex |
Age |
Survived |
Freq | |
---|---|---|---|---|---|
1 |
1st |
Male |
Child |
No |
0 |
2 |
2nd |
Male |
Child |
No |
0 |
3 |
3rd |
Male |
Child |
No |
35 |
4 |
Crew |
Male |
Child |
No |
0 |
5 |
1st |
Female |
Child |
No |
0 |
6 |
2nd |
Female |
Child |
No |
0 |
7 |
3rd |
Female |
Child |
No |
17 |
8 |
Crew |
Female |
Child |
No |
0 |
9 |
1st |
Male |
Adult |
No |
118 |
10 |
2nd |
Male |
Adult |
No |
154 |
11 |
3rd |
Male |
Adult |
No |
387 |
12 |
Crew |
Male |
Adult |
No |
670 |
13 |
1st |
Female |
Adult |
No |
4 |
14 |
2nd |
Female |
Adult |
No |
13 |
15 |
3rd |
Female |
Adult |
No |
89 |
16 |
Crew |
Female |
Adult |
No |
3 |
17 |
1st |
Male |
Child |
Yes |
5 |
18 |
2nd |
Male |
Child |
Yes |
11 |
19 |
3rd |
Male |
Child |
Yes |
13 |
20 |
Crew |
Male |
Child |
Yes |
0 |
21 |
1st |
Female |
Child |
Yes |
1 |
22 |
2nd |
Female |
Child |
Yes |
13 |
23 |
3rd |
Female |
Child |
Yes |
14 |
24 |
Crew |
Female |
Child |
Yes |
0 |
25 |
1st |
Male |
Adult |
Yes |
57 |
26 |
2nd |
Male |
Adult |
Yes |
14 |
27 |
3rd |
Male |
Adult |
Yes |
75 |
28 |
Crew |
Male |
Adult |
Yes |
192 |
29 |
1st |
Female |
Adult |
Yes |
140 |
30 |
2nd |
Female |
Adult |
Yes |
80 |
31 |
3rd |
Female |
Adult |
Yes |
76 |
32 |
Crew |
Female |
Adult |
Yes |
20 |
As can be seen, there are five attributes: the class of the individuals (first class, second class, third class, or crew), their sex and age, whether they survived, and the frequency of cases in each cell. In order to create our training and testing datasets, we first need to reconstruct the full dataset—that is, without the frequency weightings. We'll write some ad hoc code to do just this:
1 Titanic.df_weighted = data.frame(Titanic) 2 # creating empty data frame to be populated 3 Titanic.df = Titanic.df_weighted[0,1:4] 4 5 # populating the data frame 6 k=0 7 for (i in 1:nrow(Titanic.df_weighted)){ 8 if (Titanic.df_weighted[i,5]>0) { 9 n = Titanic.df_weighted[i,5] 10 for (j in 1:n) { 11 k = k + 1 12 Titanic.df [k,] = 13 unlist(Titanic.df_weighted[i,1:4]) 14 } 15 } 16 }
Let's check whether we obtained the right output. We'll create a table again, and compare it to the Titanic
dataset:
table(Titanic.df) == Titanic
We omit the output here. Suffice to say that the entries in both tables (the one we just built, and the Titanic
dataset) are identical. We now know we computed the data correctly.
What we want to do now is create the two datasets that we need—the training dataset and the test dataset. We will go for a simple way of doing this instead of using stratified sampling. Random sampling is enough for our demonstration purposes here. We will set the seed to 1 so that the samples are identical on your screen and in this book:
1 set.seed(1) 2 Titanic.df$Filter= sample(c("TRAIN","TEST"), 3 nrow(Titanic.df), replace = T) 4 TRAIN = subset (Titanic.df, Filter == "TRAIN") 5 TEST = subset (Titanic.df, Filter == "TEST")
Now let's build a classifier based on the data in the TRAIN
dataset, and have a look at the prior and conditional probabilities:
6 Classify = naiveBayes(TRAIN[1:3],TRAIN[[4]]) 7 Classify
The output, in the following screenshot shows that the conditional probability (hereafter, simply probability) of dying, for individuals in the training dataset, is 0.687395
, whereas the probability of surviving is 0.3126095
. Looking at the conditional probabilities, we can see that people in first class have a higher probability of surviving; in second class, the relative probability of surviving is reduced; and third class passengers and crew members had high probabilities of dying rather than that of surviving. Looking at the attribute Sex
, we can see that males had a higher probability of dying, which is the opposite of what is observed for females. Children had a higher probability of surviving, whereas for adults, these probabilities were quite similar.
We will now examine how well the classifier will be able to use this information to classify the other individuals—that is, those of the testing dataset:
TEST$Classified = predict(Classify,TEST[1:3]) table(TEST$Survived,TEST$Classified)
As shown in the diagonals of the outputted confusion table presented here, the classifier correctly predicted nonsurvival (true negatives) in 645
cases, and survival (true positive) in 168
cases. In addition, the classifier incorrectly predicted nonsurvival in 186
cases (false negative), and survival (false positive) in 60
cases:
No |
Yes | |
---|---|---|
No |
645 |
60 |
Yes |
186 |
168 |
3.145.172.146