TF-IDF

TF-IDF stands for term frequency-inverse document frequency, which measures how important a word is to a document in a collection of documents. It is used extensively in informational retrieval and reflects the weightage of the word in the document. The TF-IDF value increases in proportion to the number of occurrences of the words otherwise known as frequency of the word/term and consists of two key elements, the term frequency and the inverse document frequency.

TF is the term frequency, which is the frequency of a word/term in the document.
For a term t, tf measures the number of times term t occurs in document d. tf is implemented in Spark using hashing where a term is mapped into an index by applying a hash function.

IDF is the inverse document frequency, which represents the information a term provides about the tendency of the term to appear in documents. IDF is a log-scaled inverse function of documents containing the term:

IDF = TotalDocuments/Documents containing Term

Once we have TF and IDF, we can compute the TF-IDF value by multiplying the TF and IDF:

TF-IDF = TF * IDF

We will now look at how we can generate TF using the HashingTF Transformer in Spark ML.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.45.137