Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Working with Naïve Bayes in R

For this example of working with Naïve Bayes in R, we are going to use the Titanic dataset. The classification problem we have is to know whether or not individuals died in the Titanic accident. We will create a training dataset and a testing dataset (in order to test how well the classifier performs).

The first thing we need to know is how to convert the Titanic dataset (of class table) to a data frame:

1  Titanic.df_weighted = data.frame(Titanic)

Let's have a look at the dataset:

	Class	Sex	Age	Survived	Freq
1	1st	Male	Child	No	0
2	2nd	Male	Child	No	0
3	3rd	Male	Child	No	35
4	Crew	Male	Child	No	0
5	1st	Female	Child	No	0
6	2nd	Female	Child	No	0
7	3rd	Female	Child	No	17
8	Crew	Female	Child	No	0
9	1st	Male	Adult	No	118
10	2nd	Male	Adult	No	154
11	3rd	Male	Adult	No	387
12	Crew	Male	Adult	No	670
13	1st	Female	Adult	No	4
14	2nd	Female	Adult	No	13
15	3rd	Female	Adult	No	89
16	Crew	Female	Adult	No	3
17	1st	Male	Child	Yes	5
18	2nd	Male	Child	Yes	11
19	3rd	Male	Child	Yes	13
20	Crew	Male	Child	Yes	0
21	1st	Female	Child	Yes	1
22	2nd	Female	Child	Yes	13
23	3rd	Female	Child	Yes	14
24	Crew	Female	Child	Yes	0
25	1st	Male	Adult	Yes	57
26	2nd	Male	Adult	Yes	14
27	3rd	Male	Adult	Yes	75
28	Crew	Male	Adult	Yes	192
29	1st	Female	Adult	Yes	140
30	2nd	Female	Adult	Yes	80
31	3rd	Female	Adult	Yes	76
32	Crew	Female	Adult	Yes	20

As can be seen, there are five attributes: the class of the individuals (first class, second class, third class, or crew), their sex and age, whether they survived, and the frequency of cases in each cell. In order to create our training and testing datasets, we first need to reconstruct the full dataset—that is, without the frequency weightings. We'll write some ad hoc code to do just this:

1  Titanic.df_weighted = data.frame(Titanic)
2  # creating empty data frame to be populated
3  Titanic.df = Titanic.df_weighted[0,1:4]
4  
5  # populating the data frame
6  k=0
7  for (i in 1:nrow(Titanic.df_weighted)){
8     if (Titanic.df_weighted[i,5]>0) {
9        n = Titanic.df_weighted[i,5]
10        for (j in 1:n) {
11           k = k + 1
12           Titanic.df [k,] =  
13              unlist(Titanic.df_weighted[i,1:4])
14        }
15     }
16  }

Let's check whether we obtained the right output. We'll create a table again, and compare it to the Titanic dataset:

table(Titanic.df) == Titanic

We omit the output here. Suffice to say that the entries in both tables (the one we just built, and the Titanic dataset) are identical. We now know we computed the data correctly.

What we want to do now is create the two datasets that we need—the training dataset and the test dataset. We will go for a simple way of doing this instead of using stratified sampling. Random sampling is enough for our demonstration purposes here. We will set the seed to 1 so that the samples are identical on your screen and in this book:

1  set.seed(1)
2  Titanic.df$Filter= sample(c("TRAIN","TEST"), 
3     nrow(Titanic.df), replace = T)
4  TRAIN = subset (Titanic.df, Filter == "TRAIN")
5  TEST = subset (Titanic.df, Filter == "TEST")

Now let's build a classifier based on the data in the TRAIN dataset, and have a look at the prior and conditional probabilities:

6  Classify = naiveBayes(TRAIN[1:3],TRAIN[[4]])
7  Classify

The output, in the following screenshot shows that the conditional probability (hereafter, simply probability) of dying, for individuals in the training dataset, is 0.687395, whereas the probability of surviving is 0.3126095. Looking at the conditional probabilities, we can see that people in first class have a higher probability of surviving; in second class, the relative probability of surviving is reduced; and third class passengers and crew members had high probabilities of dying rather than that of surviving. Looking at the attribute Sex, we can see that males had a higher probability of dying, which is the opposite of what is observed for females. Children had a higher probability of surviving, whereas for adults, these probabilities were quite similar.

The classifier for our second example

We will now examine how well the classifier will be able to use this information to classify the other individuals—that is, those of the testing dataset:

TEST$Classified = predict(Classify,TEST[1:3])
table(TEST$Survived,TEST$Classified)

As shown in the diagonals of the outputted confusion table presented here, the classifier correctly predicted nonsurvival (true negatives) in 645 cases, and survival (true positive) in 168 cases. In addition, the classifier incorrectly predicted nonsurvival in 186 cases (false negative), and survival (false positive) in 60 cases:

	No	Yes
No	645	60
Yes	186	168

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Working with Naïve Bayes in R

Create new playlist

Sign In

Sign Up

Working with Naïve Bayes in R

Table of Contents for
Working with Naïve Bayes in R