Identifying Semantic Similarities within Text

Semantic similarity refers to how closely related two or more different texts are to each other. That is, how much two words, sentences, or other text entities are alike. Finding similarities is useful as a classification technique and has been used by applications such as spelling and plagiarism checkers.

We can assess the similarity between two words using a number of techniques. At a simplistic level, we can identify how much change is required to convert one word into another word using a sequence of insertion, deletion, and/or substitution operations.

At a deeper level, we can examine the meaning of words to determine their similarity. For example, the words teaching and instructing are spelled very differently, but they convey the same basic concept. Stemming and lemmatization can be useful in making these types of comparisons.

The Apache Commons Text library possesses a number of classes in the org.apache.commons.text.similarity package, and supports similarity between strings. In this chapter, we will demonstrate a number of these classes and methods.

The algorithms that will be illustrated include the following:

Cosine similarity
Hamming distance
Levenshtein distance

In this chapter, we will cover the following recipes:

Finding the cosine similarity of the text
Finding the distance between text
Finding differences between plaintext instances
Finding hyponyms and antonyms

Table of Contents for Identifying Semantic Similarities within Text

Create new playlist

Sign In

Sign Up

Table of Contents for
Identifying Semantic Similarities within Text