TF-IDF, per its namesake, is comprised of two statistics: term frequency (TF) and inverse document frequency (IDF).
The central thesis to TF is that if a word (called a term) occurs many times in a document, it means that the document revolves more around that word. It makes sense; look at your emails. The keywords typically revolve around a central topic. But TF is a lot more simplistic than that. There is no notion of topics. It's just a count of how many times a word happens in a document.
IDF, on the other hand, is a statistic that determines how important a term is to a document. In the examples we've seen, do note that the word Subject, with a capital S occurs once in both types of documents: spam and ham. In broad strokes, IDF is calculated by the following:
.
The exact formula varies and there are subtleties to each variation, but all adhere to the notion of dividing the total number of documents over the frequency of the term.
For the purposes of our project, we will be using the tf-idf library from go-nlp, which is a repository of NLP-related libraries for Go. To install it, simply run the following command:
go get -u github.com/go-nlp/tfidf
It is an extremely well, tested library, with 100% test coverage.
When used together, represents a useful weighting scheme for calculating the importance of a word in a document. It may seem simple, but it is very powerful, especially when used in the context of probability.
Now we are ready to talk about the basics of the Naive Bayes algorithm. But first I'd like to further emphasize certain intuitions of Bayes' theorem.