Topic coherence

Topic coherence measures the semantic consistency of the topic model results, that is, whether humans would perceive the words and their probabilities associated with topics as meaningful.

To this end, it scores each topic by measuring the degree of semantic similarity between the words most relevant to the topic. More specifically, coherence measures are based on the probability of observing the set of words W that define a topic together.

We use two measures of coherence that have been designed for LDA and shown to align with human judgment of topic quality, namely the UMass and the UCI measures.

The UCI metric defines a word pair's score to be the sum of the Pointwise Mutual Information (PMI) between two distinct pairs of (top) topic words wi, wjw and a smoothing factor ε:

The probabilities are computed from word co-occurrence frequencies in a sliding window over an external corpus such as Wikipedia, so that this metric can be thought of as an external comparison to a semantic ground truth.

In contrast, the UMass metric uses the co-occurrences in a number of documents D from the training corpus to compute a coherence score:

Rather than a comparison to an extrinsic ground truth, this measure reflects intrinsic coherence. Both measures have been evaluated to align well with human judgment. In both cases, values closer to zero imply that a topic is more coherent.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.105.124