Parsing, tokenizing, and annotating a sentence

Parsed document content is iterable, and each element has numerous attributes produced by the processing pipeline. The following sample illustrates how to access the following attributes:

  • .text: Original word text
  • .lemma_: Word root
  • .pos_: Basic POS tag
  • .tag_: Detailed POS tag
  • .dep_: Syntactic relationship or dependency between tokens
  • .shape_: The shape of the word regarding capitalization, punctuation, or digits
  • .is alpha: Check whether the token is alphanumeric
  •  .is stop: Check whether the token is on a list of common words for the given language

We iterate over each token and assign its attributes to a pd.DataFrame:

pd.DataFrame([[t.text, t.lemma_, t.pos_, t.tag_, t.dep_, t.shape_, t.is_alpha, t.is_stop] for t in doc],
columns=['text', 'lemma', 'pos', 'tag', 'dep', 'shape', 'is_alpha', 'is_stop'])

Which produces the following output:

text

lemma

pos

tag

dep

shape

is_alpha

is_stop

Apple

apple

PROPN

NNP

nsubj

Xxxxx

TRUE

FALSE

is

be

VERB

VBZ

aux

xx

TRUE

TRUE

looking

look

VERB

VBG

ROOT

xxxx

TRUE

FALSE

at

at

ADP

IN

prep

xx

TRUE

TRUE

buying

buy

VERB

VBG

pcomp

xxxx

TRUE

FALSE

U.K.

u.k.

PROPN

NNP

compound

X.X.

FALSE

FALSE

startup

startup

NOUN

NN

dobj

xxxx

TRUE

FALSE

for

for

ADP

IN

prep

xxx

TRUE

TRUE

$

$

SYM

$

quantmod

$

FALSE

FALSE

1

1

NUM

CD

compound

d

FALSE

FALSE

billion

billion

NUM

CD

pobj

xxxx

TRUE

FALSE

We can visualize syntactic dependency in a browser or notebook using the following:

displacy.render(doc, style='dep', options=options, jupyter=True)

The result is a dependency tree:

Dependency tree

We can get additional insights into the meaning of attributes using spacy.explain(), as here:

spacy.explain("VBZ")
verb, 3rd person singular present
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.127.141