Multi-language NLP

spaCy includes trained language models for English, German, Spanish, Portuguese, French, Italian, and Dutch, as well as a multi-language model for NER. Cross-language usage is straightforward since the API does not change.

We will illustrate the Spanish language model using a parallel corpus of TED Talk subtitles (see the GitHub repo for data source references). For this purpose, we instantiate both language models:

model = {}
for language in ['en', 'es']:
model[language] = spacy.load(language)

We then read small corresponding text samples in each model:

text = {}
path = Path('../data/TED')
for language in ['en', 'es']:
file_name = path / 'TED2013_sample.{}'.format(language)
text[language] = file_name.read_text()

Sentence boundary detection uses the same logic but finds a different breakdown:

parsed, sentences = {}, {}
for language in ['en', 'es']:
parsed[language] = model[language](text[language])
sentences[language] = list(parsed[language].sents)
print('Sentences:', language, len(sentences[language]))
Sentences: en 19
Sentences: es 22

POS tagging also works in the same way:

pos = {}
for language in ['en', 'es']:
pos[language] = pd.DataFrame([[t.text, t.pos_, spacy.explain(t.pos_)] for t in sentences[language][0]],
columns=['Token', 'POS Tag', 'Meaning'])
pd.concat([pos['en'], pos['es']], axis=1).head()

The result is the side-by-side token annotations for the English and Spanish documents:

Token

POS Tag

Meaning

Token

POS Tag

Meaning

There

ADV

adverb

Existe

VERB

verb

s

VERB

verb

una

DET

determiner

a

DET

determiner

estrecha

ADJ

adjective

tight

ADJ

adjective

y

CONJ

conjunction

and

CCONJ

coordinating conjunction

sorprendente

ADJ

adjective

 

The next section illustrates how to use parsed and annotated tokens to build a document-term matrix that can be used for text classification.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.231.245