Chapter 10. Topic Modeling

In the last chapter we covered some of the techniques used to extract information from text. These techniques can be complicated to implement and may also be slow. If the application requires the information extracted to be present to users, these techniques are great. If we are looking to extract information as part of an intermediate processing step, for instance building features for a classfier, then we don’t need to extract readable information. Essentially, we want to reduce the dimensionality of our data. This is where distributional semantics comes in.

Distributional semantics is the study using statistical distributions of elements of language to characterize similarities between documents, speech acts, or elements. The idea for this field comes from John R. Firth, a linguist in the first half of the 20th Century. He noted how semantics was dependent on context, and coined the oft repeated quote

You shall know a word by the company it keeps.

The idea is that you can represent a word by a distribution over the contexts it appears in. These seems to make sense, but is bit nebulous. What is the distribution over the contexts for a given word. In text-based NLP, we generally look at cooccurrences with other words. We can represent the contexts a word appears in by looking at words appear in those contexts. The approaches to this are generally discussed in terms of linear algebra.

We have a document-term matrix where the rows are documents, and columns are terms. The values can be binary representing word presence, counts of the number of times a term occurs, or can TF.IDF values. We will be using TF.IDF values in this chapter. Once we have such a matrix, we will want to map our documents to a space with fewer dimensions. This clusters the documents into topics.

An important caveat is that, despite the name, distrubutional semantics does not actually look at the semantics of a word. The semantics of a word is a representational relationship of an element of language to the world. The distributions that we look at characterize a word or a document by the words that occur in their context. There is a connection between context and semantics, as Firth pointed out, but without access to the world we don’t actually get the semantics of the element.

K-Means

Let’s try a classic clustering technique first - K-Means. Let’s say we have a number of data points in a vector space. We can pick K points in the vector space, called centroids, and assign each datapoint to the closest centroid. We want to find the K points that minimize the distance between the datapoints and their centroid.

In our situation, we have documents in a vector space defined by the TF.IDF values. When we find our K points, we can say that each K represents a topic. Let’s look at an example.

First, let’s build our data set.

from collections import defaultdict, Counter, OrderedDict

import numpy as np
import pandas as pd
import scipy.sparse as sparse

from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

from nltk.corpus import stopwords
from nltk.corpus import brown
en_stopwords = set(stopwords.words('english'))
def detokenize(sentence):
    text = ''
    for token in sentence:
        if text and any(c.isalnum() for c in token):
            text += ' '
        text += token
    return text

We will want to remove punctuation and stop words, so our algorithms just need to use the “useful” words.

def process(sentence):
    terms = []
    for term in sentence:
        term = term.lower()
        if term not in en_stopwords and term.isalnum():
            terms.append(term)
    return terms

Let’s gather our docs. Our documents will be lists of lists of terms.

docs = OrderedDict()

for fid in brown.fileids():
    docs[fid] = brown.sents(fid)

Now we will construct indexes.

ix2doc = list(docs)
doc2ix = {fid: i for i, fid in enumerate(ix2doc)}
vocabulary = set()

term_counts = defaultdict(Counter)
document_counts = Counter()

for fid, doc in docs.items():
    unique_terms = set()
    for sentence in doc:
        sentence = process(sentence)
        term_counts[fid].update(sentence)
        unique_terms.update(sentence)
    document_counts.update(unique_terms)
    vocabulary.update(unique_terms)

ix2term = sorted(list(vocabulary))
term2ix = OrderedDict()
for i, term in enumerate(ix2term):
    term2ix[term] = i

Now that we have our indexes, let’s construct a matrix for TF and IDF.

term_count_mat = sparse.dok_matrix((len(doc2ix), len(term2ix)))

for fid, i in doc2ix.items():
    for term, count in term_counts[fid].items():
        j = term2ix[term]
        term_count_mat[i, j] = count
term_count_mat = term_count_mat.todense()

doc_count_vec = np.array(
    [document_counts[term] for term in term2ix.keys()])
tf = np.log(term_count_mat + 1)
idf = len(doc2ix) / (1 + doc_count_vec)

tfidf = np.multiply(tf, idf)
tfidf.shape
(500, 40881)

Note the dimensions. This is a rather large matrix for such a small dataset. Although we could represent it sparsely, many algorithms that we may use downstream require dense representations. Appart from space efficiency concerns, having this many dimensions can worsen performence for some algorithms. This is what distributional semantics looks to help with. 

Now, we can build our model.

from sklearn.cluster import KMeans
K = 6
clusters = ['cluster#{}'.format(k) for k in range(K)]
model = KMeans(n_clusters=K, random_state=314)
clustered.shape
(500, 6)

We can see that we have now clustered our documents using our 6 centroids. Each of these centroids is a vector over our vocabulary. We can look at which words are most influential on our centroids. We will use wordclouds for this.

model.cluster_centers_.shape
(6, 40881)
cluster_term = pd.DataFrame(model.cluster_centers_.T, index=ix2term, columns=clusters)
cluster_term = np.round(cluster_term, decimals=4)
fig, axs = plt.subplots(K // 2, 2, figsize=(10, 8))

k = 0
for i in range(len(axs)):
    for j in range(len(axs[i])):
        wc = WordCloud(colormap='Greys', background_color='white')
        im = wc.generate_from_frequencies(cluster_term[clusters[k]])
        axs[i][j].imshow(im, interpolation='bilinear')
        axs[i][j].axis("off")
        axs[i][j].set_title(clusters[k])
        k += 1
        
plt.tight_layout()
plt.show()
Figure 10-1. word clouds built from the centroids

We can see some recognizable topics here. Cluster #5 seems to be about mathematical topics. Cluster #3 seems to be about food preparation, specifically pasteurization.

K-Means does not make many assumptions about our data, it just tries to find the K. One drawback to K-Means, is that it tends to create equally sized clusters. This is an unrealistic expectation for a natural corpus. Additionally, we don’t get much in the way if characterizing the similarity between documents. Let’s try an algorithm that makes some more assumptions, but will give us a way to see how similar documents are to each other.

Latent Semantic Indexing

Latent Semantic Indexing is technique for decomposing the document-term matrix using Singular Value Decomposition or SVD. In SVD we decompose a matrix into three matrices.

upper M right-arrow upper U normal upper Sigma upper V Superscript upper T

Σ is a diagonal matrix of the singular values in descending order. We can take the top K, and this serves as an approximation of the original matrix. The first K columns of U are the representation of the documents in the K dimensional space. The first K columns of V are the representation of the terms in the K dimensional space. This let’s us compare the similarity of documents and terms. It is common to choose a larger number for the components, so we will set K to be higher number here.

from sklearn.decomposition import TruncatedSVD
K = 100
clusters = ['cluster#{}'.format(k) for k in range(K)]
model = TruncatedSVD(n_components=K)
clustered = model.fit_transform(tfidf)

Let’s look at the K singular values we are keeping.

model.singular_values_
array([3529.39905473, 3244.51395305, 3096.10335704, 3004.8882987 ,
       2814.77858204, 2778.96902533, 2754.2942512 , 2714.32865945,
       2652.4119094 , 2631.64362227, 2578.41230573, 2496.86392987,
       2478.31563312, 2466.82942537, 2465.83674175, 2450.22361278,
       2426.99364435, 2417.13989816, 2407.40572992, 2394.21460258,
       2379.89976747, 2369.78970648, 2344.36252585, 2337.77090924,
       2324.76055049, 2319.07434771, 2308.81232676, 2304.85707171,
       2300.6888689 , 2299.08592131, 2292.18931562, 2281.59638332,
       2280.80535179, 2276.55977269, 2265.29827699, 2264.49999278,
       2259.19162875, 2253.20088136, 2249.34547946, 2239.31921392,
       2232.24240145, 2221.95468155, 2217.95110287, 2208.94458997,
       2199.75216312, 2195.85509817, 2189.76213831, 2186.64540094,
       2178.92705724, 2170.98276352, 2164.19734464, 2159.85021389,
       2154.82652164, 2145.5169884 , 2142.3070779 , 2138.06410065,
       2132.8723198 , 2125.68060737, 2123.13051755, 2121.25651627,
       2119.0925646 , 2113.46585963, 2102.77888039, 2101.07116001,
       2094.0766712 , 2090.41516403, 2086.00515811, 2080.55424737,
       2075.54071367, 2070.03500007, 2066.78292077, 2064.93112208,
       2056.24857815, 2052.96995192, 2048.62550688, 2045.18447518,
       2038.27607405, 2032.74196113, 2026.9687453 , 2022.61629887,
       2018.05274649, 2011.24594096, 2009.64212589, 2004.15307518,
       2000.17006712, 1995.76552783, 1985.15438092, 1981.71380603,
       1977.60898352, 1973.78806955, 1966.68359784, 1962.29604116,
       1956.62028269, 1951.54510897, 1951.25219946, 1943.75611963,
       1939.85749385, 1933.30524488, 1928.57205628, 1919.57447254])

The components_ of the model are the rows of V trucated to the first K columns. So now, lets look at the terms distributed over the components.

cluster_term = pd.DataFrame(model.components_.T, index=ix2term, columns=clusters)
cluster_term = np.round(cluster_term, decimals=4)
cluster_term.loc[['polynomial', 'cat', 'frankfurter']]
term cluster#0 cluster#1 ... cluster#98 cluster#99
polynomial 0.0003 0.0012 ... 0.0077 -0.0182
cat 0.0002 0.0018 ... 0.0056 -0.0026
frankfurter 0.0004 0.0018 ... -0.0025 -0.0025

Since we did not stem our words, let’s see if we can find “polynomials, from the vector for “polynomial”. We will use cosine similarity for this. The dot product is a technique for looking at the similarity between two vectors. The idea is that we want to look at the angle between two vectors. If they are parallel, the similarity should be 1, if they are orthogonal, the similarity should be 0, and if they are going in opposite directions the similarity should be -1. So we want to look at the cosine of the angle between of them. The dot product of two vectors is equal to the product the magnitudes of the two vectors times the cosine of the angle between them. So, we can take the dot product divided by the product of the magnitudes.

Scipy has a function for cosine distance, which is one minues cosine similarity. We want the similarity, though.

from scipy.spatial.distance import cosine

def cossim(u, v):
    return 1 - cosine(u, v)
polynomial_vec = cluster_term.iloc[term2ix['polynomial']]

similarities = cluster_term.apply(
    lambda r: cossim(polynomial_vec, r), axis=1)
similarities.sort_values(ascending=False)[:20]
polynomial        1.000000
nilpotent         0.999999
diagonalizable    0.999999
commute           0.999999
polynomials       0.999999
subspace          0.999999
divisible         0.999998
satisfies         0.999998
differentiable    0.999998
monic             0.999998
algebraically     0.999998
primes            0.999996
spanned           0.999996
decomposes        0.999996
scalar            0.999996
commutes          0.999996
algebra           0.999996
integers          0.999991
subspaces         0.999991
exponential       0.999991
dtype: float64

We see that “polynomials” is very close, as are a number of other mathematically themed words. This is the “semantics” that people often refer to distributional semantics capturing. We can use these representations as features. The larger, and more diverse, the corpus, the more generally applicable these representations will be.

chosen_ix = [0, 97, 1, 98, 2, 99]

fig, axs = plt.subplots(3, 2, figsize=(10, 8))

k = 0
for i in range(len(axs)):
    for j in range(len(axs[i])):
        wc = WordCloud(colormap='Greys', background_color='white')
        im = wc.generate_from_frequencies(cluster_term[clusters[chosen_ix[k]]])
        axs[i][j].imshow(im, interpolation='bilinear')
        axs[i][j].axis("off")
        axs[i][j].set_title(clusters[chosen_ix[k]])
        k += 1
        
plt.tight_layout()
plt.show()  2.10_LSI_wordcloud.png

 

Figure 10-2. word clouds built from the term distributions

 

Cluster #2 seems to be related medical. The others don’t appear very informative. This makes sense since this is an approximation of the original matrix, not a clustering of the matrix. This does reduce the dimensions though, so we could still use this for downstream processing.

The idea that our data is composed of some features relating terms to documents can be directly addressed.

Exercises

Let’s see how these techniques work on our classification problem from chapter 9. We will be using Spark’s implementation of LDA for this.

First, let’s load the data.

import os
import re

import numpy as np
import pandas as pd

from pyspark.sql.types import *
from pyspark.sql.functions import expr
from pyspark.sql import Row
from pyspark.ml import Pipeline
from pyspark.ml.feature import *
from pyspark.ml.clustering import LDA
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

import sparknlp
from sparknlp import DocumentAssembler, Finisher
from sparknlp.annotator import *

%matplotlib inline

spark = sparknlp.start()
HEADER_PTN = re.compile(r'^[a-zA-Z-]+:.*')

def remove_header(path_text_pair):
    path, text = path_text_pair
    lines = text.split('
')
    line_iterator = iter(lines)
    while HEADER_PTN.match(next(line_iterator)) is not None:
        pass
    return path, '
'.join(line_iterator)
path = os.path.join('data', 'mini_newsgroups', '*')
texts = spark.sparkContext.wholeTextFiles(path).map(remove_header)

schema = StructType([
    StructField('path', StringType()),
    StructField('text', StringType()),
])

texts = spark.createDataFrame(texts, schema=schema) 
    .withColumn('newsgroup', expr('split(path, "/")[7]')) 
    .persist()

train, test = texts.randomSplit([0.8, 0.2], seed=123)

Now, let’s build our NLP pipeline.

assembler = DocumentAssembler()
    .setInputCol('text')
    .setOutputCol('document')
sentence = SentenceDetector() 
    .setInputCols(["document"]) 
    .setOutputCol("sentences")
tokenizer = Tokenizer()
    .setInputCols(['sentences'])
    .setOutputCol('token')
lemmatizer = LemmatizerModel.pretrained()
    .setInputCols(['token'])
    .setOutputCol('lemma')
normalizer = Normalizer()
    .setCleanupPatterns([
        '[^a-zA-Z.-]+', 
        '^[^a-zA-Z]+', 
        '[^a-zA-Z]+$',
    ])
    .setInputCols(['lemma'])
    .setOutputCol('normalized')
    .setLowercase(True)
finisher = Finisher()
    .setInputCols(['normalized'])
    .setOutputCols(['normalized'])
    .setOutputAsArray(True)

Let’s remove stop words and use TF.IDF vectors.

stopwords = set(StopWordsRemover.loadDefaultStopWords("english"))

sw_remover = StopWordsRemover() 
    .setInputCol("normalized") 
    .setOutputCol("filtered") 
    .setStopWords(list(stopwords))

count_vectorizer = CountVectorizer(
    inputCol='filtered', outputCol='tf', minDF=10)
idf = IDF(inputCol='tf', outputCol='tfidf')

Spark has an implementation of LDA. Let’s use that in combination with logistic regression as our classifier. We will be combining the output of the LDA model with the TF.IDF vectors using the VectorAssembler.

lda = LDA(
    featuresCol='tfidf',
    seed=123,
    maxIter=20,
    k=100,
    topicDistributionCol='topicDistribution',
)

vec_assembler = VectorAssembler(
    inputCols=['tfidf', 'topicDistribution'])
logreg = LogisticRegression(
    featuresCol='topicDistribution',
    maxIter=100,
    regParam=0.0,
    elasticNetParam=0.0,
)

Finally, we assemble our pipeline.

label_indexer = StringIndexer(
    inputCol='newsgroup', outputCol='label')

pipeline = Pipeline().setStages([
    assembler, sentence, tokenizer, 
    lemmatizer, normalizer, finisher,
    sw_remover, count_vectorizer, idf,
    lda, vec_assembler, 
    label_indexer, logreg
])
evaluator = MulticlassClassificationEvaluator(metricName='f1')
model = pipeline.fit(train)
train_predicted = model.transform(train)
test_predicted = model.transform(test)
print('f1', evaluator.evaluate(train_predicted))
f1 0.9956621119176594
print('f1', evaluator.evaluate(test_predicted))
f1 0.5957199376998746

This seems to overfit more than before. Try regularization, try only using the topics.

Good luck!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.170.206