The NLP workflow

A key goal in using ML from text data for algorithmic trading is to extract signals from documents. A document is an individual sample from a relevant text data source, such as a company report, a headline or news article, or a tweet. A corpus, in turn, is a collection of documents (plural: corpora).

The following diagram lays out the key steps to convert documents into a dataset that can be used to train a supervised ML algorithm capable of making actionable predictions:

Fundamental techniques extract text features semantic units called tokens, and use linguistic rules and dictionaries to enrich these tokens with linguistic and semantic annotations. The bag-of-words (BoW) model uses token frequency to model documents as token vectors, which leads to the document-term matrix that is frequently used for text classification.

Advanced approaches use ML to refine features extracted by these fundamental techniques and produce more informative document models. These include topic models that reflect the joint usage of tokens across documents and word-vector models that capture the context of token usage.

We will review key decisions made at each step and related trade-offs in more detail before illustrating their implementation using the spaCy library in the next section. The following table summarizes the key tasks of an NLP pipeline:

Feature

Description

Tokenization

Segments text into words, punctuation marks, and so on.

POS tagging

Assigns word types to tokens, such as a verb or noun.

Dependency parsing

Labels syntactic token dependencies, such as subject <=> object.

Stemming and lemmatization

Assigns the base forms of words: was => be, rats => rat.

Sentence boundary detection

Finds and segments individual sentences.

Named entity recognition

Labels real-world objects, such as people, companies, and locations.

Similarity

Evaluates the similarity of words, text spans, and documents.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.201.59.20