TF-IDF 

TF-IDF, per its namesake, is comprised of two statistics: term frequency (TF) and inverse document frequency (IDF).

The central thesis to TF is that if a word (called a term) occurs many times in a document, it means that the document revolves more around that word. It makes sense; look at your emails. The keywords typically revolve around a central topic. But TF is a lot more simplistic than that. There is no notion of topics. It's just a count of how many times a word happens in a document.

IDF, on the other hand, is a statistic that determines how important a term is to a document. In the examples we've seen, do note that the word Subject, with a capital S occurs once in both types of documents: spam and ham. In broad strokes, IDF is calculated by the following:

.

The exact formula varies and there are subtleties to each variation, but all adhere to the notion of dividing the total number of documents over the frequency of the term.

For the purposes of our project, we will be using the tf-idf library from go-nlp, which is a repository of NLP-related libraries for Go. To install it, simply run the following command:

go get -u github.com/go-nlp/tfidf

 It is an extremely well, tested library, with 100% test coverage.

When used together,  represents a useful weighting scheme for calculating the importance of a word in a document. It may seem simple, but it is very powerful, especially when used in the context of probability.

Do note that TF-IDF cannot strictly be interpreted as a probability. There are some theoretical nastiness that presents itself when strictly interpreting IDF as a probability. Hence, in the context of this project, we will be treating TF-IDF as a sort of weighting scheme to a probability.

Now we are ready to talk about the basics of the Naive Bayes algorithm. But first I'd like to further emphasize certain intuitions of Bayes' theorem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.83.199