index

Symbols

!python -m spacy download en_core_web_lg command 132, 209, 390

!python -m spacy download en_core_web_md command 132, 209, 390

!python -m spacy download en_core_web_sm command 132, 209, 390

A

accuracy 45, 196

additive smoothing 318

add-one smoothing 318

adjectives lexicons 253

adposition 130

ADP POS tag 144

ADP tag 130

aggregating sentiment scores

with help of lexicon 235237

with sentiment lexicon 251259

collecting sentiment scores from lexicon 252255

detecting review polarity 255259

algorithms

topic clustering 338341

topic modeling 360378

applying LDA model 371374

exploring results 375378

loading data 361363

preprocessing data 363371

alpha parameter 317

argmax 401

arrays 7

author profiling 151193

authorship attribution

overview 154

practical use of 226227

Decision Trees 175

classifier basics 175177

evaluating which tree is better using node impurity 178184

on language data 185191

selection of best split in 184185

linguistic feature engineering for 194228

feature engineering for authorship attribution 200226

machine-learning pipeline 196200

practical use of authorship attribution and user profiling 226227

machine-learning pipeline 157175

original data 157163

setting up benchmark 169175

testing generalization behavior 163168

user profiling 155157

B

bag-of-words models 397

base form 91

baseline 46

benchmark machine learning model 157

bigram modeling 22

Binarizer tool 289

binary classification 33, 159

binary method 258

BIO scheme 393394

Blei, David 381

Boolean search algorithm 8386

C

centroid 327

character bigrams 22

character unigram 21

chatbots 20

chunk.root.dep function 141

chunk.root.head.text function 141

chunk.root.text function 141

chunk.text function 141

CISI (Centre for Inventions and Scientific Information) dataset 75

classes

defining 37

implementing spam filter 4649

classification 32

classifiers

Decision Trees 175177

evaluating

implementing spam filter 6265

overview 4546

evaluating performance of 196197

Naïve Bayes 207, 312320

training

implementing spam filter 5361

overview 4344

class label 33

clustering

evaluation of topic clustering algorithm 338341

for topic discovery 330337

codecs.open function 47

completeness 339

concordance function 64

conditional probability 54

confusion matrices 196

content words 18

convergence 329

conversational agents 18, 20

cosine similarity 13, 104

Counter functionality 216

CountVectorizer tool 285

cross-validation 292295

cross_val_predict functionality 293

cross_val_score functionality 293

CSV (comma-separated values) 404

.csv format 404

D

data

data structures 7583

defining 37

for sentiment analysis

analyzing 243251

loading and preprocessing 240242

implementing spam filter 4649

NER (named-entity recognition) 403406

processing 8795

morphological processing 9095

removing stopwords 8790

supervised ML (machine learning) 308312

topic modeling algorithm

applying LDA model 371374

exploring results 375378

loading data 361363

preprocessing data 363371

DATE type 410

decision rule 176

Decision Trees 175

classifier basics 175177

evaluating which tree is better using node impurity 178184

on language data 185191

selection of best split in 184185

deep learning 4

def keyword 15

dependence on context 277295

evaluating with cross-validation 292295

extracting features from text 284289

preparing data 278284

Scikit-learn machine-learning pipeline 289292

dependency parsing 139144

dependents 140

df.shape function 405

displaCy visualization tool 142, 416

dobj (direct object) 142

dobj relation 145

documents

document similarity retrieval 104105

inverse document frequency 100103

dot product 14

downstream tasks 386

E

Enron dataset 47

Euclidean distance 11

Euclidean space 11

evaluating classifiers

implementing spam filter 6265

overview 4546

evidence 58

extracting features

implementing spam filter 5052

overview 4243

F

F1 measure 199

F1 score 199

feature engineering for authorship attribution 200226

counts of stopwords and proportion of stopwords as features 207211

distribution of word suffixes as features 219222

distributions of parts of speech as features 212218

unique words as features 223226

word and sentence length statistics as features 201207

features 25, 33, 152, 195

feature selection 187

feature sparsity 187

feature vector 25, 201

fit method 204, 288

fit-predict routine 289

fit_transform method 286

format functionality 133

formatted string literals 49

frequent words lexicons 253

functions 25, 43

function words 18, 88

G

generalization behavior, testing 163168

generative models 357

gensim functionality 368

get_feature_names() function 287

get_feature_names_out() function 287

Gibbs Sampling for the Uninitiated (Resnik and Hardisty) 355

Gini impurity 182

gold standard labels 75

GPE (geopolitical entity) 385

GPE type 410

grammar checking 2829

ground truth labels 75

Gutenberg Project data 159

H

ham class 43

Hardisty, Eric 355

heads 140

homogeneity 339

I

IDE (integrated development environment), Python 9

idf (inverse document frequency) 100103

if statement 34

information extraction 114150

building information extraction algorithm 144148

part-of-speech tagging 124137

with spaCy 128137

word types 124128

syntactic parsing 137144

dependency parsing with spaCy 139144

sentence structure 137139

task 120124

use cases 116120

with NER (named-entity recognition) 410415

information retrieval 5

information search 5, 71113

advanced 1618

overview 516

processing data 8795

morphological processing 9095

removing stopwords 8790

search algorithm 103111

deploying 111

document similarity retrieval 104105

evaluation of results 106111

tasks 7286

Boolean search algorithm 8386

data and data structures 7583

weighing words 96103

with inverse document frequency 100103

with term frequency 97100

input functionality 66

installation instructions 422

integrated development environment (IDE) 9

inverse document frequency (idf) 100103

J

Jurafsky, Dan 2, 392

K

keywords 96

k-fold cross-validation 293

K-means clustering 337

KMeans clustering algorithm 337

L

language data, on Decision Trees 185191

language generation 1925

language modeling 24

Laplace smoothing 318

latent factors 336

LDA (latent Dirichlet allocation) 307, 348360

as generative model 356360

estimating parameters for 352356

lemmas base forms 130

lemmatization 130

lemmatizer tool 18, 130

length-normalized vectors 13

length of sentiment-bearing features 295297

lexicons, sentiment 251259

aggregating sentiment scores with 235237

collecting sentiment scores from 252255

detecting review polarity 255259

linguistic feature engineering 194228

feature engineering for authorship attribution 200226

counts of stopwords and proportion of stopwords as features 207211

distribution of word suffixes as features 219222

distributions of parts of speech as features 212218

unique words as features 223226

word and sentence length statistics as features 201207

machine-learning pipeline 196200

evaluating performance of classifier 196197

further evaluation measures 197200

practical use of authorship attribution and user profiling 226227

list comprehensions 48

LOC (location) 385

lower bound on algorithm’s performance 173

M

machine learning. See ML

machine translation 2628

majority class baseline 46

Markov models 396

Martin, James H. 2, 392

math functionality 12

mean precision 107

mean precision @k 107

mean reciprocal rank (MRR) 109

metrics functionality 204, 320

ML (machine learning)

addressing dependence on context with 277295

evaluating with cross-validation 292295

extracting features from text 284289

preparing data 278284

Scikit-learn machine-learning pipeline 289292

author profiling 151193

authorship attribution 154

Decision Trees 175

machine-learning pipeline 157175

user profiling 155157

linguistic feature engineering 196200

evaluating performance of classifier 196197

further evaluation measures 197200

machine-learning pipeline, Scikit-learn 289292

topic classification as supervised task 307325

data 308312

evaluation of results 320325

with Naïve Bayes 312320

topic discovery as unsupervised task 325341

clustering 330337

evaluation of topic clustering algorithm 338341

unsupervised ML (machine-learning) approaches 325329

morphological forms 90

morphological processing 9095

morphology 90

MRR (mean reciprocal rank) 109

multiclass classification 33, 307325

N

Naïve Bayes 207, 312320

Natural Language Processing Toolkit (NLTK) 49

negation 298301

NEG marker 298

NER (named-entity recognition) 384421

20 Newsgroups dataset 308

as sequence labeling task 392403

BIO scheme 393394

sequential solution for NER 397403

sequential tasks 395397

BIOES scheme 393

challenges in 390392

IO scheme 393

named entity (NE) types 388390

practical applications of 403418

data loading and exploration 403406

information extraction 410415

named entities visualization 416418

named entity types exploration with spaCy 406410

neural-based language modeling 24

neural machine translation (NMT) 28

n-grams 22, 24, 280

NLP (natural language processing) 130

history of 24

spam filtering 3170

deploying spam filter in practice 6566

implementing spam filter 4665

overview 3135

tasks 3646

tasks 529

advanced information search 1618

conversational agents and intelligent virtual assistants 1820

information search 516

machine translation 2628

spam filtering 25

spell- and grammar checking 2829

text prediction and language generation 2025

nlp pipeline 130

nltk.download() command 50, 84, 89, 159, 269

NLTK (Natural Language Processing Toolkit) 49

NMT (neural machine translation) 28

node impurity 178184

normalizing features

implementing spam filter 5052

overview 4243

noun phrases 140

NP (noun phrase) 145

nsubj relation 145

O

operator’s itemgetter functionality 104

operator functionality 220

ORDINAL type 410

ORG (organization) type 385, 410

os functionality 240

os module 47

P

pandas 404

pandas read_csv functionality 404

parsers 18, 140

part-of-speech taggers 18, 128

part-of-speech tagging 124137

with spaCy 128137

word types 124128

parts of speech 212218

PART tag 134

PERSON type 410

Pipeline functionality 289

pipelines

author profiling 157175

original data 157163

setting up benchmark 169175

testing generalization behavior 163168

linguistic feature engineering 196200

evaluating performance of classifier 196197

further evaluation measures 197200

sentiment analysis 239251

analyzing data 243251

data loading and preprocessing 240242

plot_confusion_matrix functionality 323

pobj (prepositional object) 142

polarity, sentiment 255259

POS taggers 128

precision 106, 198

predict method 204, 289

prepositional object (pobj) 142

prior probability 58

probabilistic classifier 53

probability estimation 21, 136137

processing data 8795

morphological processing 9095

removing stopwords 8790

proper nouns 130

PROPN tag 130, 134

PUNCT (punctuation marks) 134

pyLDAvis 377

Pythagorean theorem 11

Q

question answering 116

R

random functionality 330

random_state parameter 204

rank 110

recall 198

reciprocal rank 109

re module 39

Resnik, Philip 355

S

Scikit-learn machine-learning pipeline 289292

search algorithm 73, 103111

deploying 111

document similarity retrieval 104105

evaluation of results 106111

sentences, length statistics 201207

sentiment analysis 229303

addressing dependence on context with machine learning 277295

evaluating with cross-validation 292295

extracting features from text 284289

preparing data 278284

Scikit-learn machine-learning pipeline 289292

aggregating sentiment scores with sentiment lexicon 251259

collecting sentiment scores from lexicon 252255

detecting review polarity 255259

negation handling for 298301

setting up pipeline 239251

analyzing data 243251

data loading and preprocessing 240242

task 234238

aggregating sentiment score with help of lexicon 235237

learning to detect sentiment in data-driven way 237238

 

use cases 231234

varying length of sentiment-bearing features 295297

with SentiWordNet 266276

sentiment lexicon-based approach 235

sentiment lexicons 264

SentiWordNet 266276

sequence labeling 392403

BIO scheme 393394

sequential solution for NER 397403

sequential tasks 395397

show_topic functionality 375

simple heuristic algorithm 252

simple_preprocess functionality 365

singular value decomposition (SVD) 332

sklearn’s function 167

smoothing 318

SMT (statistical machine translation) 28

SnowballStemmer algorithm 364

sorting algorithm 73

spaCy

dependency parsing with 139144

named entity types exploration with 406410

part-of-speech tagging with 128137

spaCy’s functionality 209

spacy.load command 131

spam 37

spam class 43

spam filtering 25, 3170

deploying spam filter in practice 6566

implementing spam filter 4665

defining data and classes 4649

evaluating classifier 6265

extracting and normalizing features 5052

splitting text into words 4950

training classifier 5361

overview 3135

tasks 3646

defining data and classes 37

evaluating classifier 4546

extracting and normalizing features 4243

splitting text into words 3742

training classifier 4344

spam filters 25

Speech and Language Processing (Jurafsky and Martin) 2, 392

spell-checking 2829

 

splitting text

implementing spam filter 4950

overview 3742

statistical machine translation (SMT) 28

stemmer tools 18, 93

stemming 92

stopping criteria 329

stopwords 18, 88, 151

count of and proportion of as features 207211

removing 8790

stratified data split 166

StratifiedShuffleSplit function 167

stratified shuffling split 166

string module 89

suffixes 219222

supervised ML (machine learning) 34, 153, 307325

data 308312

evaluation of results 320325

with Naïve Bayes 312320

SVD (singular value decomposition) 332

synset 267

syntactic parsing 137144

dependency parsing with spaCy 139144

sentence structure 137139

T

term frequency 97100

terminal (lower) leaves 177

test set 44, 160, 231

text, extracting features from 284289

text classification 25

text prediction 2025

TF-IDF (term frequency—inverse document frequency) 314

TfidfVectorizer 314

tf (term frequency) 97

token.i attribute 131

tokenization 42

tokenizer tool 9, 41, 84, 129

token.lemma attribute 131

token.lower attribute 131

token object 131

token.pos attribute 131

token.text attribute 131

topic analysis 304345

topic classification as supervised ML task 307325

data 308312

evaluation of results 320325

with Naïve Bayes 312320

topic discovery as unsupervised ML task 325341

clustering 330337

evaluation of topic clustering algorithm 338341

unsupervised approaches 325329

topic modeling 346383

implementation of topic modeling algorithm 360378

applying LDA model 371374

exploring results 375378

loading data 361363

preprocessing data 363371

with LDA (latent Dirichlet allocation) 349360

as generative model 356360

estimating parameters for 352356

training classifiers

implementing spam filter 5361

overview 4344

training set 44, 231

transform function 333

transform method 316

trigram modeling 22

true positives 106

U

unique() function 406

unsupervised ML (machine learning) 325341

approaches to 325329

clustering 330337

evaluation of topic clustering algorithm 338341

upper bound on algorithm’s performance 173

user profiling

overview 155157

practical use of 226227

V

validation set 231

vector array 9

vectors 7

virtual assistants 1820

visualization, named entities 416418

V-measure 340

W

WordNet 267

words 103

distribution of suffixes as features 219222

length statistics as features 201207

types 124128

unique words as features 223226

weighing 96

with inverse document frequency 100103

with term frequency 97100

word unigram 21

Z

Zipf’s law 189

zip function 133

 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.91.206