Working with Naïve Bayes in R

For this example of working with Naïve Bayes in R, we are going to use the Titanic dataset. The classification problem we have is to know whether or not individuals died in the Titanic accident. We will create a training dataset and a testing dataset (in order to test how well the classifier performs).

The first thing we need to know is how to convert the Titanic dataset (of class table) to a data frame:

1  Titanic.df_weighted = data.frame(Titanic)

Let's have a look at the dataset:

 

Class

Sex

Age

Survived

Freq

1

1st

Male

Child

No

0

2

2nd

Male

Child

No

0

3

3rd

Male

Child

No

35

4

Crew

Male

Child

No

0

5

1st

Female

Child

No

0

6

2nd

Female

Child

No

0

7

3rd

Female

Child

No

17

8

Crew

Female

Child

No

0

9

1st

Male

Adult

No

118

10

2nd

Male

Adult

No

154

11

3rd

Male

Adult

No

387

12

Crew

Male

Adult

No

670

13

1st

Female

Adult

No

4

14

2nd

Female

Adult

No

13

15

3rd

Female

Adult

No

89

16

Crew

Female

Adult

No

3

17

1st

Male

Child

Yes

5

18

2nd

Male

Child

Yes

11

19

3rd

Male

Child

Yes

13

20

Crew

Male

Child

Yes

0

21

1st

Female

Child

Yes

1

22

2nd

Female

Child

Yes

13

23

3rd

Female

Child

Yes

14

24

Crew

Female

Child

Yes

0

25

1st

Male

Adult

Yes

57

26

2nd

Male

Adult

Yes

14

27

3rd

Male

Adult

Yes

75

28

Crew

Male

Adult

Yes

192

29

1st

Female

Adult

Yes

140

30

2nd

Female

Adult

Yes

80

31

3rd

Female

Adult

Yes

76

32

Crew

Female

Adult

Yes

20

As can be seen, there are five attributes: the class of the individuals (first class, second class, third class, or crew), their sex and age, whether they survived, and the frequency of cases in each cell. In order to create our training and testing datasets, we first need to reconstruct the full dataset—that is, without the frequency weightings. We'll write some ad hoc code to do just this:

1  Titanic.df_weighted = data.frame(Titanic)
2  # creating empty data frame to be populated
3  Titanic.df = Titanic.df_weighted[0,1:4]
4  
5  # populating the data frame
6  k=0
7  for (i in 1:nrow(Titanic.df_weighted)){
8     if (Titanic.df_weighted[i,5]>0) {
9        n = Titanic.df_weighted[i,5]
10        for (j in 1:n) {
11           k = k + 1
12           Titanic.df [k,] =  
13              unlist(Titanic.df_weighted[i,1:4])
14        }
15     }
16  }

Let's check whether we obtained the right output. We'll create a table again, and compare it to the Titanic dataset:

table(Titanic.df) == Titanic

We omit the output here. Suffice to say that the entries in both tables (the one we just built, and the Titanic dataset) are identical. We now know we computed the data correctly.

What we want to do now is create the two datasets that we need—the training dataset and the test dataset. We will go for a simple way of doing this instead of using stratified sampling. Random sampling is enough for our demonstration purposes here. We will set the seed to 1 so that the samples are identical on your screen and in this book:

1  set.seed(1)
2  Titanic.df$Filter= sample(c("TRAIN","TEST"), 
3     nrow(Titanic.df), replace = T)
4  TRAIN = subset (Titanic.df, Filter == "TRAIN")
5  TEST = subset (Titanic.df, Filter == "TEST")

Now let's build a classifier based on the data in the TRAIN dataset, and have a look at the prior and conditional probabilities:

6  Classify = naiveBayes(TRAIN[1:3],TRAIN[[4]])
7  Classify 

The output, in the following screenshot shows that the conditional probability (hereafter, simply probability) of dying, for individuals in the training dataset, is 0.687395, whereas the probability of surviving is 0.3126095. Looking at the conditional probabilities, we can see that people in first class have a higher probability of surviving; in second class, the relative probability of surviving is reduced; and third class passengers and crew members had high probabilities of dying rather than that of surviving. Looking at the attribute Sex, we can see that males had a higher probability of dying, which is the opposite of what is observed for females. Children had a higher probability of surviving, whereas for adults, these probabilities were quite similar.

Working with Naïve Bayes in R

The classifier for our second example

We will now examine how well the classifier will be able to use this information to classify the other individuals—that is, those of the testing dataset:

TEST$Classified = predict(Classify,TEST[1:3])
table(TEST$Survived,TEST$Classified)

As shown in the diagonals of the outputted confusion table presented here, the classifier correctly predicted nonsurvival (true negatives) in 645 cases, and survival (true positive) in 168 cases. In addition, the classifier incorrectly predicted nonsurvival in 186 cases (false negative), and survival (false positive) in 60 cases:

 

No

Yes

No

645

60

Yes

186

168

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.197.251