Features

We've seen from the previous examples, that we need features, such as whether a fruit can be green, yellow, or red, or whether it's tropical. We're now focused on the project at hand. What should the features be?:

Class

???

???

???

Spam

Ham

 

What makes up an email? Words make an email. So, it would be appropriate to consider the appearance of each word feature. We can take it further, and take the intuition that we have developed previously with TF-IDF and instead use the frequency of the words among the document types. Instead of counting 1 for the existence, we count the total number of times a word exists in the document types.

The table would look something as follows:

Class

Has XXX

Has Site

Has Free

Has Linguistics

...

Spam

200

189

70

2

...

Ham

1

2

55

120

...

 

That also means that there are many features. We can certainly try to enumerate all possible calculations. But doing so would be tedious and quite computationally intensive. Instead, we can try to be clever about it. Specifically, we will use another definition of conditional probability to do the trick to reduce the amount of computations that needs to be done.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.101.81