Working with Text Data

This is the first of three chapters dedicated to extracting signals for algorithmic trading strategies from text data using natural language processing (NLP) and machine learning (ML).

Text data is very rich in content, yet unstructured in format, and hence requires more preprocessing so that an ML algorithm can extract the potential signal. The key challenge lies in converting text into a numerical format for use by an algorithm, while simultaneously expressing the semantics or meaning of the content. We will cover several techniques that capture nuances of language that are readily understandable to humans so that they can become an input for ML algorithms.

In this chapter, we introduce fundamental feature extraction techniques that focus on individual semantic units; that is, words or short groups of words called tokens. We will show how to represent documents as vectors of token counts by creating a document-term matrix that, in turn, serves as input for text classification and sentiment analysis. We will also introduce the Naive Bayes algorithm, which is popular for this purpose.

In the following two chapters, we build on these techniques and use ML algorithms such as topic modeling and word-vector embedding to capture information contained in a broader context.

In particular, in this chapter, we will cover the following:

  • What the fundamental NLP workflow looks like
  • How to build a multilingual feature extraction pipeline using spaCy and TextBlob
  • How to perform NLP tasks such as part-of-speech (POS) tagging or named entity recognition
  • How to convert tokens to numbers using the document-term matrix
  • How to classify text using the Naive Bayes model
  • How to perform sentiment analysis
The code samples for the following sections are in the GitHub repository for this chapter, and references are listed in the main README file.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.197.123