Topic modeling

Topics in natural language processing don't exactly match the dictionary definition and correspond to more of a nebulous statistical concept. We speak of topic models and probability distributions of words linked to topics, as we know them. When we read a text, we expect certain words appearing in the title or the body of the text to capture the semantic context of the document. An article about Python programming will have words such as class and function, while a story about snakes will have words such as eggs and afraid. Documents usually have multiple topics, for instance, this recipe is about topic models and non-negative matrix factorization, which we will discuss shortly. We can, therefore, define an additive model for topics by assigning different weights to topics.

One of the topic modeling algorithms is non-negative matrix factorization (NMF). This algorithm factorizes a matrix into a product of two smaller matrices in such a way that the three matrices have no negative values. Usually, we are only able to numerically approximate the solution of the factorization, and the time complexity is polynomial. The scikit-learn NMF class implements this algorithm, as shown in the following table:

Constructor parameter

Default

Example values

Description

n_components

-

5, None

Number of components. In this example, this corresponds to the number of topics.

max_iter

200

300, 10

Number of iterations.

alpha

0

10, 2.85

Multiplication factor for regularization terms.

tol

1e-4

1e-3, 1e-2

Value regulating stopping conditions.

NMF can also be applied to document clustering and signal processing, as shown in the following code:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.datasets import fetch_20newsgroups
>>> from nltk.corpus import names
>>> from nltk.stem import WordNetLemmatizer
>>> from sklearn.decomposition import NMF
>>> def letters_only(astr):
return astr.isalpha()
>>> cv = CountVectorizer(stop_words="english", max_features=500)
>>> groups = fetch_20newsgroups()
>>> cleaned = []
>>> all_names = set(names.words())
>>> lemmatizer = WordNetLemmatizer()
>>> for post in groups.data:
cleaned.append(' '.join([
lemmatizer.lemmatize(word.lower())
for word in post.split()
if letters_only(word)
and word not in all_names]))
>>> transformed = cv.fit_transform(cleaned)
>> nmf = NMF(n_components=100, random_state=43).fit(transformed)
>>> for topic_idx, topic in enumerate(nmf.components_):
label = '{}: '.format(topic_idx)
print(label, " ".join([cv.get_feature_names()[i]
for i in topic.argsort()[:-9:-1]]))

We get the following 100 topics:

0:  wa went came told said started took saw
1: db bit data stuff place add time line
2: file change source information ftp section entry server
3: output line write open entry return read file
4: disk drive controller hard support card board head
5: entry program file rule source info number need
6: hockey league team division game player san final
7: image software user package include support display color
8: window manager application using user server work screen
9: united house control second american national issue period
10: internet anonymous email address user information mail
network
11: use using note similar work usually provide case
12: turkish jew jewish war did world sent book
13: space national international technology earth office news
technical
14: anonymous posting service server user group message post
15: science evidence study model computer come method result
16: widget application value set type return function display
17: work job young school lot need create private
18: available version server widget includes support source sun
19: center research medical institute national test study north
20: armenian turkish russian muslim world road city today
21: computer information internet network email issue policy
communication
22: ground box need usually power code house current
23: russian president american support food money important
private
24: ibm color week memory hardware standard monitor software
25: la win san went list radio year near
26: child case le report area group research national
27: key message bit security algorithm attack encryption standard
28: encryption technology access device policy security need
government
29: god bible shall man come life hell love
30: atheist religious religion belief god sort feel idea
31: drive head single scsi mode set model type
32: war military world attack russian united force day
33: section military shall weapon person division application
mean
34: water city division similar north list today high
35: think lot try trying talk agree kind saying
36: data information available user model set based national
37: good cover better great pretty player probably best
38: tape scsi driver drive work need memory following
39: dod bike member started computer mean live message
40: car speed driver change high better buy different
41: just maybe start thought big probably getting guy
42: right second free shall security mean left individual
43: problem work having help using apple error running
44: greek turkish killed act western muslim word talk
45: israeli arab jew attack policy true jewish fact
46: argument form true event evidence truth particular known
47: president said did group tax press working package
48: time long having lot order able different better
49: rate city difference crime control le white study
50: new york change old lost study early care
51: power period second san special le play result
52: wa did thought later left order seen man
53: state united political national federal local member le
54: doe mean anybody different actually help common reading
55: list post offer group information course manager open
56: ftp available anonymous package general list ibm version
57: nasa center space cost available contact information faq
58: ha able called taken given exactly past real
59: san police information said group league political including
60: drug group war information study usa reason taken
61: point line different algorithm exactly better mean issue
62: image color version free available display better current
63: got shot play went took goal hit lead
64: people country live doing tell killed saying lot
65: run running home start hit version win speed
66: day come word christian jewish said tell little
67: want need help let life reason trying copy
68: used using product function note version single standard
69: game win sound play left second great lead
70: know tell need let sure understand come far
71: believe belief christian truth claim evidence mean different
72: public private message security issue standard mail user
73: church christian member group true bible view different
74: question answer ask asked did reason true claim
75: like look sound long guy little having pretty
76: human life person moral kill claim world reason
77: thing saw got sure trying seen asked kind
78: health medical national care study user person public
79: make sure sense little difference end try probably
80: law federal act specific issue order moral clear
81: unit disk size serial total got bit national
82: chip clipper serial algorithm need phone communication
encryption
83: going come mean kind working look sure looking
84: university general thanks department engineering texas world
computing
85: way set best love long value actually white
86: card driver video support mode mouse board memory
87: gun crime weapon death study control police used
88: service communication sale small technology current cost site
89: graphic send mail message server package various computer
90: team player win play better best bad look
91: really better lot probably sure little player best
92: say did mean act word said clear read
93: program change display lot try using technology application
94: number serial following large men le million report
95: book read reference copy speed history original according
96: year ago old did best hit long couple
97: woman men muslim religion world man great life
98: government political federal sure free local country reason
99: article read world usa post opinion discussion bike
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.74.160