In the last chapter we covered some of the techniques used to extract information from text. These techniques can be complicated to implement and may also be slow. If the application requires the information extracted to be present to users, these techniques are great. If we are looking to extract information as part of an intermediate processing step, for instance building features for a classfier, then we don’t need to extract readable information. Essentially, we want to reduce the dimensionality of our data. This is where distributional semantics comes in.
Distributional semantics is the study using statistical distributions of elements of language to characterize similarities between documents, speech acts, or elements. The idea for this field comes from John R. Firth, a linguist in the first half of the 20th Century. He noted how semantics was dependent on context, and coined the oft repeated quote
You shall know a word by the company it keeps.
The idea is that you can represent a word by a distribution over the contexts it appears in. These seems to make sense, but is bit nebulous. What is the distribution over the contexts for a given word. In text-based NLP, we generally look at cooccurrences with other words. We can represent the contexts a word appears in by looking at words appear in those contexts. The approaches to this are generally discussed in terms of linear algebra.
We have a document-term matrix where the rows are documents, and columns are terms. The values can be binary representing word presence, counts of the number of times a term occurs, or can TF.IDF values. We will be using TF.IDF values in this chapter. Once we have such a matrix, we will want to map our documents to a space with fewer dimensions. This clusters the documents into topics.
An important caveat is that, despite the name, distrubutional semantics does not actually look at the semantics of a word. The semantics of a word is a representational relationship of an element of language to the world. The distributions that we look at characterize a word or a document by the words that occur in their context. There is a connection between context and semantics, as Firth pointed out, but without access to the world we don’t actually get the semantics of the element.
Let’s try a classic clustering technique first - K-Means. Let’s say we have a number of data points in a vector space. We can pick K
points in the vector space, called centroids, and assign each datapoint to the closest centroid. We want to find the K
points that minimize the distance between the datapoints and their centroid.
In our situation, we have documents in a vector space defined by the TF.IDF values. When we find our K
points, we can say that each K represents a topic. Let’s look at an example.
First, let’s build our data set.
from collections import defaultdict, Counter, OrderedDict import numpy as np import pandas as pd import scipy.sparse as sparse from wordcloud import WordCloud import matplotlib.pyplot as plt %matplotlib inline from nltk.corpus import stopwords from nltk.corpus import brown en_stopwords = set(stopwords.words('english'))
def detokenize(sentence): text = '' for token in sentence: if text and any(c.isalnum() for c in token): text += ' ' text += token return text
We will want to remove punctuation and stop words, so our algorithms just need to use the “useful” words.
def process(sentence): terms = [] for term in sentence: term = term.lower() if term not in en_stopwords and term.isalnum(): terms.append(term) return terms
Let’s gather our docs. Our documents will be lists of lists of terms.
docs = OrderedDict() for fid in brown.fileids(): docs[fid] = brown.sents(fid)
Now we will construct indexes.
ix2doc = list(docs) doc2ix = {fid: i for i, fid in enumerate(ix2doc)} vocabulary = set() term_counts = defaultdict(Counter) document_counts = Counter() for fid, doc in docs.items(): unique_terms = set() for sentence in doc: sentence = process(sentence) term_counts[fid].update(sentence) unique_terms.update(sentence) document_counts.update(unique_terms) vocabulary.update(unique_terms) ix2term = sorted(list(vocabulary)) term2ix = OrderedDict() for i, term in enumerate(ix2term): term2ix[term] = i
Now that we have our indexes, let’s construct a matrix for TF and IDF.
term_count_mat = sparse.dok_matrix((len(doc2ix), len(term2ix))) for fid, i in doc2ix.items(): for term, count in term_counts[fid].items(): j = term2ix[term] term_count_mat[i, j] = count term_count_mat = term_count_mat.todense() doc_count_vec = np.array( [document_counts[term] for term in term2ix.keys()])
tf = np.log(term_count_mat + 1) idf = len(doc2ix) / (1 + doc_count_vec) tfidf = np.multiply(tf, idf)
tfidf.shape
(500, 40881)
Note the dimensions. This is a rather large matrix for such a small dataset. Although we could represent it sparsely, many algorithms that we may use downstream require dense representations. Appart from space efficiency concerns, having this many dimensions can worsen performence for some algorithms. This is what distributional semantics looks to help with.
Now, we can build our model.
from sklearn.cluster import KMeans
K = 6 clusters = ['cluster#{}'.format(k) for k in range(K)] model = KMeans(n_clusters=K, random_state=314)
clustered.shape
We can see that we have now clustered our documents using our 6 centroids. Each of these centroids is a vector over our vocabulary. We can look at which words are most influential on our centroids. We will use wordclouds for this.
model.cluster_centers_.shape
(6, 40881)
cluster_term = pd.DataFrame(model.cluster_centers_.T, index=ix2term, columns=clusters) cluster_term = np.round(cluster_term, decimals=4)
fig, axs = plt.subplots(K // 2, 2, figsize=(10, 8)) k = 0 for i in range(len(axs)): for j in range(len(axs[i])): wc = WordCloud(colormap='Greys', background_color='white') im = wc.generate_from_frequencies(cluster_term[clusters[k]]) axs[i][j].imshow(im, interpolation='bilinear') axs[i][j].axis("off") axs[i][j].set_title(clusters[k]) k += 1 plt.tight_layout() plt.show()
We can see some recognizable topics here. Cluster #5 seems to be about mathematical topics. Cluster #3 seems to be about food preparation, specifically pasteurization.
K-Means does not make many assumptions about our data, it just tries to find the K
. One drawback to K-Means, is that it tends to create equally sized clusters. This is an unrealistic expectation for a natural corpus. Additionally, we don’t get much in the way if characterizing the similarity between documents. Let’s try an algorithm that makes some more assumptions, but will give us a way to see how similar documents are to each other.
Latent Semantic Indexing is technique for decomposing the document-term matrix using Singular Value Decomposition or SVD. In SVD we decompose a matrix into three matrices.
Σ
is a diagonal matrix of the singular values in descending order. We can take the top K
, and this serves as an approximation of the original matrix. The first K
columns of U
are the representation of the documents in the K
dimensional space. The first K
columns of V
are the representation of the terms in the K
dimensional space. This let’s us compare the similarity of documents and terms. It is common to choose a larger number for the components, so we will set K
to be higher number here.
from sklearn.decomposition import TruncatedSVD
K = 100 clusters = ['cluster#{}'.format(k) for k in range(K)] model = TruncatedSVD(n_components=K)
clustered = model.fit_transform(tfidf)
Let’s look at the K
singular values we are keeping.
model.singular_values_
array([3529.39905473, 3244.51395305, 3096.10335704, 3004.8882987 , 2814.77858204, 2778.96902533, 2754.2942512 , 2714.32865945, 2652.4119094 , 2631.64362227, 2578.41230573, 2496.86392987, 2478.31563312, 2466.82942537, 2465.83674175, 2450.22361278, 2426.99364435, 2417.13989816, 2407.40572992, 2394.21460258, 2379.89976747, 2369.78970648, 2344.36252585, 2337.77090924, 2324.76055049, 2319.07434771, 2308.81232676, 2304.85707171, 2300.6888689 , 2299.08592131, 2292.18931562, 2281.59638332, 2280.80535179, 2276.55977269, 2265.29827699, 2264.49999278, 2259.19162875, 2253.20088136, 2249.34547946, 2239.31921392, 2232.24240145, 2221.95468155, 2217.95110287, 2208.94458997, 2199.75216312, 2195.85509817, 2189.76213831, 2186.64540094, 2178.92705724, 2170.98276352, 2164.19734464, 2159.85021389, 2154.82652164, 2145.5169884 , 2142.3070779 , 2138.06410065, 2132.8723198 , 2125.68060737, 2123.13051755, 2121.25651627, 2119.0925646 , 2113.46585963, 2102.77888039, 2101.07116001, 2094.0766712 , 2090.41516403, 2086.00515811, 2080.55424737, 2075.54071367, 2070.03500007, 2066.78292077, 2064.93112208, 2056.24857815, 2052.96995192, 2048.62550688, 2045.18447518, 2038.27607405, 2032.74196113, 2026.9687453 , 2022.61629887, 2018.05274649, 2011.24594096, 2009.64212589, 2004.15307518, 2000.17006712, 1995.76552783, 1985.15438092, 1981.71380603, 1977.60898352, 1973.78806955, 1966.68359784, 1962.29604116, 1956.62028269, 1951.54510897, 1951.25219946, 1943.75611963, 1939.85749385, 1933.30524488, 1928.57205628, 1919.57447254])
The components_
of the model are the rows of V
trucated to the first K
columns. So now, lets look at the terms distributed over the components.
cluster_term = pd.DataFrame(model.components_.T, index=ix2term, columns=clusters) cluster_term = np.round(cluster_term, decimals=4)
cluster_term.loc[['polynomial', 'cat', 'frankfurter']]
term | cluster#0 | cluster#1 | ... | cluster#98 | cluster#99 |
---|---|---|---|---|---|
polynomial | 0.0003 | 0.0012 | ... | 0.0077 | -0.0182 |
cat | 0.0002 | 0.0018 | ... | 0.0056 | -0.0026 |
frankfurter | 0.0004 | 0.0018 | ... | -0.0025 | -0.0025 |
Since we did not stem our words, let’s see if we can find “polynomials, from the vector for “polynomial”. We will use cosine similarity for this. The dot product is a technique for looking at the similarity between two vectors. The idea is that we want to look at the angle between two vectors. If they are parallel, the similarity should be 1, if they are orthogonal, the similarity should be 0, and if they are going in opposite directions the similarity should be -1. So we want to look at the cosine of the angle between of them. The dot product of two vectors is equal to the product the magnitudes of the two vectors times the cosine of the angle between them. So, we can take the dot product divided by the product of the magnitudes.
Scipy has a function for cosine distance, which is one minues cosine similarity. We want the similarity, though.
from scipy.spatial.distance import cosine def cossim(u, v): return 1 - cosine(u, v)
polynomial_vec = cluster_term.iloc[term2ix['polynomial']] similarities = cluster_term.apply( lambda r: cossim(polynomial_vec, r), axis=1)
similarities.sort_values(ascending=False)[:20]
polynomial 1.000000 nilpotent 0.999999 diagonalizable 0.999999 commute 0.999999 polynomials 0.999999 subspace 0.999999 divisible 0.999998 satisfies 0.999998 differentiable 0.999998 monic 0.999998 algebraically 0.999998 primes 0.999996 spanned 0.999996 decomposes 0.999996 scalar 0.999996 commutes 0.999996 algebra 0.999996 integers 0.999991 subspaces 0.999991 exponential 0.999991 dtype: float64
We see that “polynomials” is very close, as are a number of other mathematically themed words. This is the “semantics” that people often refer to distributional semantics capturing. We can use these representations as features. The larger, and more diverse, the corpus, the more generally applicable these representations will be.
chosen_ix = [0, 97, 1, 98, 2, 99] fig, axs = plt.subplots(3, 2, figsize=(10, 8)) k = 0 for i in range(len(axs)): for j in range(len(axs[i])): wc = WordCloud(colormap='Greys', background_color='white') im = wc.generate_from_frequencies(cluster_term[clusters[chosen_ix[k]]]) axs[i][j].imshow(im, interpolation='bilinear') axs[i][j].axis("off") axs[i][j].set_title(clusters[chosen_ix[k]]) k += 1 plt.tight_layout() plt.show() 2.10_LSI_wordcloud.png
Cluster #2 seems to be related medical. The others don’t appear very informative. This makes sense since this is an approximation of the original matrix, not a clustering of the matrix. This does reduce the dimensions though, so we could still use this for downstream processing.
The idea that our data is composed of some features relating terms to documents can be directly addressed.
Let’s see how these techniques work on our classification problem from chapter 9. We will be using Spark’s implementation of LDA for this.
First, let’s load the data.
import os import re import numpy as np import pandas as pd from pyspark.sql.types import * from pyspark.sql.functions import expr from pyspark.sql import Row from pyspark.ml import Pipeline from pyspark.ml.feature import * from pyspark.ml.clustering import LDA from pyspark.ml.classification import LogisticRegression from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.evaluation import MulticlassClassificationEvaluator import sparknlp from sparknlp import DocumentAssembler, Finisher from sparknlp.annotator import * %matplotlib inline spark = sparknlp.start()
HEADER_PTN = re.compile(r'^[a-zA-Z-]+:.*') def remove_header(path_text_pair): path, text = path_text_pair lines = text.split(' ') line_iterator = iter(lines) while HEADER_PTN.match(next(line_iterator)) is not None: pass return path, ' '.join(line_iterator)
path = os.path.join('data', 'mini_newsgroups', '*') texts = spark.sparkContext.wholeTextFiles(path).map(remove_header) schema = StructType([ StructField('path', StringType()), StructField('text', StringType()), ]) texts = spark.createDataFrame(texts, schema=schema) .withColumn('newsgroup', expr('split(path, "/")[7]')) .persist() train, test = texts.randomSplit([0.8, 0.2], seed=123)
Now, let’s build our NLP pipeline.
assembler = DocumentAssembler() .setInputCol('text') .setOutputCol('document') sentence = SentenceDetector() .setInputCols(["document"]) .setOutputCol("sentences") tokenizer = Tokenizer() .setInputCols(['sentences']) .setOutputCol('token') lemmatizer = LemmatizerModel.pretrained() .setInputCols(['token']) .setOutputCol('lemma') normalizer = Normalizer() .setCleanupPatterns([ '[^a-zA-Z.-]+', '^[^a-zA-Z]+', '[^a-zA-Z]+$', ]) .setInputCols(['lemma']) .setOutputCol('normalized') .setLowercase(True) finisher = Finisher() .setInputCols(['normalized']) .setOutputCols(['normalized']) .setOutputAsArray(True)
Let’s remove stop words and use TF.IDF vectors.
stopwords = set(StopWordsRemover.loadDefaultStopWords("english")) sw_remover = StopWordsRemover() .setInputCol("normalized") .setOutputCol("filtered") .setStopWords(list(stopwords)) count_vectorizer = CountVectorizer( inputCol='filtered', outputCol='tf', minDF=10) idf = IDF(inputCol='tf', outputCol='tfidf')
Spark has an implementation of LDA. Let’s use that in combination with logistic regression as our classifier. We will be combining the output of the LDA model with the TF.IDF vectors using the VectorAssembler
.
lda = LDA( featuresCol='tfidf', seed=123, maxIter=20, k=100, topicDistributionCol='topicDistribution', ) vec_assembler = VectorAssembler( inputCols=['tfidf', 'topicDistribution'])
logreg = LogisticRegression( featuresCol='topicDistribution', maxIter=100, regParam=0.0, elasticNetParam=0.0, )
Finally, we assemble our pipeline.
label_indexer = StringIndexer( inputCol='newsgroup', outputCol='label') pipeline = Pipeline().setStages([ assembler, sentence, tokenizer, lemmatizer, normalizer, finisher, sw_remover, count_vectorizer, idf, lda, vec_assembler, label_indexer, logreg ])
evaluator = MulticlassClassificationEvaluator(metricName='f1')
model = pipeline.fit(train)
train_predicted = model.transform(train) test_predicted = model.transform(test)
print('f1', evaluator.evaluate(train_predicted))
f1 0.9956621119176594
print('f1', evaluator.evaluate(test_predicted))
f1 0.5957199376998746
This seems to overfit more than before. Try regularization, try only using the topics.
Good luck!
18.189.170.206