TF-IDF

As we understood the limitation of count vectorization that a highly frequent word might spoil the party. Hence, the idea is to penalize the frequent words occurring in most of the documents by assigning them a lower weight and increasing the weight of the words that appear in a subset of documents. This is the principle upon which TF-IDF works.

TF-IDF is a measure of how important a term is with respect to a document and the entire corpus (collection of documents):

TF-IDF(term) = TF(term)* IDF(term)

Term frequency (TF) is the frequency of the word appearing in the document out of all the words in the same document. For example, if there are 1,000 words in a document and we have to find out the TF of a word NLP that has appeared 50 times in that very document, we use the following:

TF(NLP)= 50/1000=0.05

Hence, we can conclude the following:

TF(term) = Number of times the term appears in the document/total number of terms in the document

In the preceding example , comprised of three documents, N1, N2, and N3, if the TF of the term count in the document N1 needs to be found, it will turn out to be like the following formula:

TF(count) N1= 2/ (2+1+1+1) = 2/5 = 0.4

It indicates the contribution of words to the document.

However,  IDF is an indicator of how significant this term is for the entire corpus:

IDF("count") = log(Total number of documents/Number of documents containing the term "count")

IDF("count") = log(3/2)= 0.17

Now, let's calculate the IDF for the term vector:

IDF("vector")=log(3/3)= 0

How do we interpret this? It implies that if the same word has appeared in all of the documents, then it is not relevant to a particular document. But, if the word appears only in a subset of documents, this means that it holds some relevance to those documents in which it exists.

Let's calculate the TF-IDF for count and vector, as follows:

TF-IDF(count) for Document N1= TF(count)*IDF(count)= 0.4 * 0.17 = 0.068

TF-IDF(vector) for Document N1 = TF(vector)* IDF(vector)= (1/5)*0 = 0

It is quite evident that, since it assigns more weight to the count in N1, it is more important than the vector. The higher the weight value, the rarer the term. The smaller the weight, the more common the term. Search engines makes use of TF-IDF to retrieve the relevant documents pertaining to a query.

Now, we will look at how to execute the count vectorizer and TF-IDF vectorizer in Python.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.233.43