VSM with Lucene

The VSM, or term vector model, is an algebraic model for representing text documents as vectors of identifiers such as index terms. It is used in information filtering, information retrieval, indexing, and relevancy rankings. 

In VSM, weights associated with the terms are calculated based on the following two numbers:

  • Term frequency (TF): How many times a particular term appears in the document
  • Inverse document frequency (IDF): How important a word is to a document in a collection

VSM is implemented in a lot of open source software, including Apache Lucene, Elasticsearch, Genism, Numpy, Weka, word2vec, and Konstanz Information Miner (KNIME). 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.151.153