Identifying Semantic Similarities within Text

Semantic similarity refers to how closely related two or more different texts are to each other. That is, how much two words, sentences, or other text entities are alike. Finding similarities is useful as a classification technique and has been used by applications such as spelling and plagiarism checkers.

We can assess the similarity between two words using a number of techniques. At a simplistic level, we can identify how much change is required to convert one word into another word using a sequence of insertion, deletion, and/or substitution operations.

At a deeper level, we can examine the meaning of words to determine their similarity. For example, the words teaching and instructing are spelled very differently, but they convey the same basic concept. Stemming and lemmatization can be useful in making these types of comparisons.

The Apache Commons Text library possesses a number of classes in the org.apache.commons.text.similarity package, and supports similarity between strings. In this chapter, we will demonstrate a number of these classes and methods.

The algorithms that will be illustrated include the following:

  • Cosine similarity
  • Hamming distance
  • Levenshtein distance

In this chapter, we will cover the following recipes:

  • Finding the cosine similarity of the text
  • Finding the distance between text
  • Finding differences between plaintext instances
  • Finding hyponyms and antonyms
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.94.213