© Isaiah Hull 2021
I. HullMachine Learning for Economics and Finance in TensorFlow 2https://doi.org/10.1007/978-1-4842-6373-0_6

6. Text Data

Isaiah Hull1  
(1)
Nacka, Sweden
 

The economics and finance disciplines have been generally reluctant to integrate forms of unstructured data. One exception to this is text, which has been applied to a wide variety of empirical problems. This may have arisen, in part, as a consequence of early successful applications in economics, such as Romer and Romer (2004), which demonstrated the empirical value of measuring internal central bank narratives.

The more widespread adoption of text may also be attributable to its many natural applications within economics and finance. It can, for instance, be used to extract latent variables, such as economic policy uncertainty from newspapers,1 consumer inflation expectations from social media content (Angelico, et al. 2018), and central bank and private firm sentiment from announcements and filings.2 It can also be used to predict bank distress (Cerchiello et al. 2017), measure the impact of news media on the business cycle (Chahrour et al. 2019), identify descriptions of fraud in consumer financial complaints (Bertsch et al. 2020), analyze financial stability (Born et al. 2013; Correa et al. 2020), forecast economic variables (Hollrah et al. 2018; Kalamara et. al 2020), and study central bank decision-making.3

The focus on textual data in economics gained renewed emphasis when Robert Shiller gave a presidential address to the American Economic Association entitled “Narrative Economics” (Shiller 2017). He argued that academic work in economics and finance has failed to account for the rise and decline of popular narratives, which have the capacity to drive macroeconomic and financial fluctuations, even if the narratives themselves are wrong. He then suggested that the discipline should begin the long project of correcting this deficiency through the exploration of text-based datasets and methods.

This chapter will discuss how text can be prepared and applied in the context of economics and finance. Throughout, we’ll use TensorFlow for modeling purposes, but will also make use of the Natural Language Toolkit (NLTK) to pre-process the data. We will also frequently refer to and use conventions from Gentzkow et al. (2019), which provides a comprehensive overview of many text analysis topics in economics and finance.

Data Cleaning and Preparation

The first step in any text analysis project is to clean and prepare the data. If, for instance, we want to use newspaper articles about a company to forecast its stock market performance, we’ll need to start by assembling a collection or “corpus” of newspaper articles and then converting the text in those articles to a numerical format.

The way in which we convert from text to numbers will determine what types of analysis we can perform. For this reason, the data cleaning and preparation step will be an important part of the pipeline for any such project. We will cover it in this subsection, focusing on its implementation using the Natural Language Toolkit (NLTK).

We’ll start by installing NLTK. We’ll then import it and download its models and datasets. You can use nltk.download('book') to download book-related data, nltk.download('popular') to download the most popular packages, or nltk.download('all') to download all available datasets and models, which is what we do in Listing 6-1.
# Install nltk.
!pip install nltk
# Import nltk.
import nltk
# Download all datasets and models
nltk.download('all')
Listing 6-1

Install, import, and prepare NLTK

Now that we’ve installed NLTK and have downloaded all of the datasets and models, we can make use of its basic data cleaning and preparation tools. Before we can do that, though, we’ll need to prepare a dataset and introduce some notation.

Collecting the Data

The data we’ll use comes from US Securities and Exchange Commission (SEC) filings, which are available through their online system, EDGAR.4 The EDGAR interface, shown in Figure 6-1, allows users to perform a variety of queries. We’ll first pull up the interface for company filings. Here, we can search for documents by company name or specify search parameters that will return documents for all companies that fit that criteria. Let’s assume that we want to create a project to monitor SEC filings about the metal mining industry. In that case, we’ll search by standard industrial classification (SIC) code.
../images/496662_1_En_6_Chapter/496662_1_En_6_Fig1_HTML.jpg
Figure 6-1

The EDGAR search interface for company filings. Source: SEC.gov

Pulling up the SEC’s list of SIC codes, we can see that metal mining has been assigned the code 1000 and falls under the responsibility of the Office of Energy and Transportation, as is shown in Figure 6-2. We can now search for all filings by companies with the 1000 SIC code, yielding the results given in Figure 6-3. Each page lists companies, the state or country associated with the filing, and the Central Index Key (CIK), which can be used to identify a filing individual or corporation.

In our case, we’ll select the filings for “Americas Gold and Silver Corp,” which you can locate by searching for 0001286973 in the CIK field. From there, we’ll look at the text of Exhibit 99.1 from the 6-K financial filing on 2020-05-15. We show the title and some text from this filing in Figure 6-4.
../images/496662_1_En_6_Chapter/496662_1_En_6_Fig2_HTML.jpg
Figure 6-2

A partial list of SIC classification codes. Source: SEC.gov

As we can see in Figure 6-4, the filing corresponds to the first quarter of 2020 and appears to contain information about the company that could be useful for assessing its value. We can see, for instance, that there is information about the firm’s acquisitions. It also discusses mining production plans at specific sites. Now that we know how to retrieve filing information from the EDGAR system and have identified a specific filing of interest, we’ll introduce notation to describe such textual information. We’ll then return to the cleaning and preparation tasks in NLTK.
../images/496662_1_En_6_Chapter/496662_1_En_6_Fig3_HTML.jpg
Figure 6-3

A partial list of metal mining company search results. Source: SEC.gov

../images/496662_1_En_6_Chapter/496662_1_En_6_Fig4_HTML.jpg
Figure 6-4

A partial 6-K financial filing for a metal mining company. Source: SEC.gov

Text Data Notation

The notation we’ll use follows Gentzkow et al. (2019). We’ll let D to denote a collection of N documents or a “corpus.” C will denote a numerical array, which contains observations on K features for each document, DjD. In some cases, we’ll predict outcomes, V, using C or we’ll use fitted values, $$ hat{V} $$, in a two-step casual inference problem.

Before we can apply NLTK to clean and prepare the data, we have to answer the following two questions:
  1. 1.

    What is D?

     
  2. 2.

    What features of D should be embodied in C?

     

If we’re working with only one 6-K filing, then Dj might be a paragraph or sentence in that filing. Alternatively, if we have many 6-K filings, then Dj is likely to represent a single filing. For the sake of fixing an example, we’ll assume that D is the collection of sentences in a single 6-K filling – namely, the one we discussed earlier.

What, then, is C? It depends on the features or “tokens” we wish to extract from each sentence of the filing. In many cases, we’ll use word counts as features; and we’ll do that in this example too. The expression for C, which is commonly referred to as the “document-feature” or “document-term” matrix is given in Equation 6-1.

Equation 6-1. Document-feature matrix .
$$ C=left(egin{array}{ccc}{c}_{11}& cdots & {c}_{1k}\ {}vdots & ddots & vdots \ {}{c}_{n1}& cdots & {c}_{nk}end{array}
ight) $$

Each element, cij, is the frequency with which word j appears in sentence i. A natural question we might ask is which words are included in the matrix? Should we include all words in a given dictionary? Or should we restrict it to words that appear at least once in the corpus?

Data Preparation

In practice, we’ll select a maximum number of words, K, based on some filtering criteria. In addition to this, we’ll also usually remove all non-word symbols, such as numbers and punctuation, during the cleaning and data preparation process. This will typically consist of four steps, which we outline as follows and then implement in an example using NLTK:
  1. 1.

    Convert to lowercase : Text data is inherently high dimensional, which will force us to use dimensionality reduction strategies wherever possible. One simple way in which we can do this is to ignore capitalization. Instead of treating “gold” and “Gold” as separate features, we’ll convert all characters to lowercase and treat them as the same word.

     
  2. 2.

    Remove stop words and rare words : Many words do not contain meaningful content, such as articles, conjunctions, and prepositions. For this reason, we will often compile a list of “stop words,” which will be removed from texts during the cleaning process. If our C matrix consists of word counts, knowing how many times the words “the” and “and” were used will not tell us much about our topic of interest. Similarly, when we exclude words from the document-term matrix, we will often exclude rare words, which do not appear frequently enough to allow a model to discern their meaning.

     
  3. 3.

    Stem or lemmatize : The need to reduce data dimensionality further will often lead us to perform “stemming” or “lemmatization.” Stemming entails converting a word to its stem. That is, we might map the verb “running” to “run.” Since many words will map to the same stem, this will reduce the dimensionality of the problem, just as converting to lowercase letters did. Removing a word stem may result in non-word, which could be undesirable when the objective of a project is to yield interpretable outputs. In this case, we will want to consider using lemmatization instead, which maps many words to one, but uses the “base” or “dictionary” version of the word, rather than a stem.

     
  4. 4.

    Remove non-word elements : In most problems we’ll encounter in economics and finance, it will not be possible to make use of punctuation, numbers, and special characters and symbols. For this reason, we will discard them, rather than including them in the document-term matrix.

     
We’ll now step through these cleaning and preparation steps in NLTK. For the sake of completeness, we’ll start by downloading the 6-K filing from SEC’s website using urllib and BeautifulSoup in Listing 6-2. Understanding these libraries will not be necessary for understanding the remainder of the chapter.
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Define url string.
url = 'https://www.sec.gov/Archives/edgar/data/1286973/000156459020025868/d934487dex991.htm'
# Send GET request.
html = urlopen(url)
# Parse HTML tree.
soup = BeautifulSoup(html.read())
# Identify all paragraphs.
paragraphs = soup.findAll('p')
# Create list of the text attributes of paragraphs.
paragraphs = [p.text for p in paragraphs]
Listing 6-2

Download HTML and extract text

To briefly explain the content of Listing 6-2, we first imported two submodules: urlopen from urllib.request and BeautifulSoup from bs4. The urlopen submodule allowed us to send GET requests, which is a way of requesting a file from a server. In this case, we requested the HTML document located at the specified url. We then used BeautifulSoup to create a parse tree from the HTML, so that we could make use of its structure, searching it by tag. Next, we searched for all instances of the “p” or paragraph tag. Using a list comprehension, we’ll step through each instance, returning its text attribute, which we’ll collect in a list of strings.

Recall that we decided to use sentences, rather than paragraphs, as our units of analysis. This means we’ll need to join the paragraphs together into a single string and then determine how to identify sentences within that string. We’ll start by merging and printing the paragraphs in Listing 6-3.
# Join paragraphs into single string.
corpus = " ".join(paragraphs)
# Print contents.
print(corpus)
Darren Blasutti VP, Corporate Development & Communications President and CEO Americas Gold and Silver Corporation Americas Gold and Silver Corporation 416-874-1708 Cautionary Statement on Forward-Looking Information: This news release contains "forward-looking information" within       the meaning of applicable securities laws. Forward-looking information includes,   ...
Listing 6-3

Join paragraphs into single string

Upon printing the corpus, we can see that it requires cleaning. It contains punctuation, stop words, line breaks, and special characters, all of which will need to be removed before computing the document-feature matrix. Now, we might be tempted to start with the cleaning step, but doing so would remove indicators of what constitutes a sentence in the text. For this reason, we’ll first split the text into sentences.

While we could write a function to perform the splitting based on the location of punctuation, this is a solved problem in natural language processing and is implemented in the NLTK toolbox. In Listing 6-4, we import NLTK, instantiate a “sentence tokenizer,” which splits a text into individual sentences, and then apply it to the corpus we constructed in the previous step.
import nltk
# Instantiate sentence tokenizer.
sentTokenizer = nltk.sent_tokenize
# Identify sentences.
sentences = sentTokenizer(corpus)
# Print the number of sentences.
print(len(sentences))
50
# Print a sentence.
print(sentences[7])
The Company continues to target commercial production by late Q2-2020 or early Q3-2020 and will be providing more regular updates regarding the operation between now and then.
Listing 6-4

Tokenize text into sentences using NLTK

The next step is to perform the previously discussed cleaning tasks. While it will generally make sense to define a single function for this purpose, we’ll divide it into three steps for the sake of clarity. We’ll start by converting all characters to lowercase and removing stop words in Listing 6-5. For now, we will leave rare words in the corpus.
from nltk.corpus import stopwords
# Convert all characters to lowercase.
sentences = [s.lower() for s in sentences]
# Define stop words as a set.
stops = set(stopwords.words('english'))
# Instantiate word tokenizer.
wordTokenizer = nltk.word_tokenize
# Divide corpus into list of lists.
words = [wordTokenizer(s) for s in sentences]
# Remove stop words.
for j in range(len(words)):
        words[j] = [w for w in words[j] if
        w not in stops]
# Print first five words in first sentence.
print(words[0][:5])
['americas', 'gold', 'silver', 'corporation', 'reports']
Listing 6-5

Convert characters to lowercase and remove stop words

In the next step, we’ll apply a stemmer to reduce the dimensionality of the dataset by collapsing each word into its stem. In Listing 6-6, we import the Porter stemmer (Porter 1980), instantiate it, and then apply it to each word in each sentence. We again print the first five words in the first sentence. We can see that the stemmer mapped “corporate” to “corpor” and “reports” to “report.” Recall that a word stem will not always be a word.
from nltk.stem.porter import PorterStemmer
# Instantiate Porter stemmer.
stemmer = PorterStemmer()
# Apply Porter stemmer.
for j in range(len(words)):
        words[j] = [stemmer.stem(w) for w in words[j]]
# Print first five words in first sentence.
print(words[0][:5])
['america', 'gold', 'silver', 'corpor', 'report']
Listing 6-6

Replace words with their stems

The last step in the cleaning process is to remove special characters, punctuation, and numbers. We’ll do this using regular expressions, which are commonly referred to as “regexes.” A regular expression is a short string that encodes a pattern that can be identified in texts. In our case, the string is [^a-z]+. The brackets indicate that the pattern is over a range of characters – namely, all the characters of the alphabet. We use the caret symbol, ^, to negate this pattern, indicating that the regex should only match characters not contained in it. This, of course, includes special symbols, punctuation, and numbers. Finally, the + symbol indicates that we allow for such symbols to be repeated in a sequence.

Listing 6-7 implements this final step in the cleaning process. We first import the library, re, which is used to implement regular expressions. Next, we iterate through each word in each sentence and substitute an empty string for any pattern matches. This leaves us with a list of sentences, each broken down into a list of words. Since the process will have left some empty strings, we’ll rejoin the words in each sentence. We’ll also remove any white space at the start and end of the sentence.
import re
# Remove special characters, punctuation, and numbers.
for j in range(len(words)):
        words[j] = [re.sub('[^a-z]+', '', w)
        for w in words[j]]
# Rejoin words into sentences.
for j in range(len(words)):
        words[j] = " ".join(words[j]).strip()
# Print sentence.
print(words[7])
compani continu target commerci product late q earli q provid regular updat regard oper
Listing 6-7

Remove special characters and join words into sentences

Printing the same sentence once again, we can see that it now looks quite different from its original form. Rather than a sentence, it looks like a collection of word stems. Indeed, in the following section, we will apply a form of text analysis that treats documents as a collection of words and ignores the order in which they appear. This is often referred to as the “bag-of-words” model.

The Bag-of-Words Model

In the previous section, we suggested that one possible construction of the document-term (DT) matrix, C, would use word counts as features. This representation would not allow us to account for grammar or word order, but it would permit us to capture word frequency. There are many problems in economics and finance in which we will be able to achieve our objective under such constraints.

The model we’ve described is called the “bag-of-words” (BoW) model, which was introduced in the information retrieval literature by Salton and McGill (1983). The term bag-of-words appears to have originated in a linguistic context in Harris (1954):

we build a stock of utterances each of which is a particular combination of particular elements. And this stock of combinations of elements becomes a factor … for language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use.

In this section, we’ll see how to construct a BoW model, starting with the cleaned and prepared data from the previous section. In addition to NLTK, we’ll also use submodules from sklearn to construct the DT matrix. While there are routines to perform such tasks in NLTK, they are not part of the core module and are generally less efficient.

Recall that words contained the 50 sentences we extracted from a 6-K filing for a metal mining company. We’ll use this list of lists to construct the document-term matrix in Listing 6-8, where we start by importing text from sklearn.feature_extraction . We’ll then instantiate a CounterVectorizer() , which will compute the frequency of words in each sentence and then construct the C matrix based on some constraints, which can be supplied as parameters. For the sake of illustration, we’ll set max_features to 10. This will constrain the maximum number of columns in the document-term matrix to be no higher than 10.

Next, we’ll apply fit_transform() to words, transforming it into a document-term matrix, C. Since C will be large for many problems, sklearn saves it as a sparse matrix. You can convert it to an array using the toarray() method . We can also apply the get_feature_names() of vectorizer() to recover the terms that correspond to each of the columns.
from sklearn.feature_extraction import text
# Instantiate vectorizer.
vectorizer = text.CountVectorizer(max_features = 10)
# Construct C matrix.
C = vectorizer.fit_transform(words)
# Print document-term matrix.
print(C.toarray())
[[3 1 0 2 0 0 1 0 2 2]
 [1 2 0 1 0 0 0 0 0 1]
        ...
        ...
        ...
 [0 1 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 1 1 0]]
# Print feature names.
print(vectorizer.get_feature_names())
['america', 'compani', 'cost', 'gold', 'includ',
'inform', 'oper', 'product', 'result', 'silver']
Listing 6-8

Construct the document-term matrix

Printing the document-term matrix and feature names, we can see that we recovered counts for ten different features. While this was useful for the sake of illustration, we will typically want to use considerably more features in actual applications; however, allowing more features may result in the inclusion of less useful features, which will necessitate the use of filtering.

Sklearn provides us with two additional parameters we can use to perform filtering: max_df and min_df. The max_df parameter determines the maximum number or proportion of documents that a term may appear in before it is removed from the term matrix. Similarly, the minimum threshold is given by min_df. In both cases, specifying an integer value, such as 3, indicates a document count, whereas specifying a float, such as 0.25, indicates a proportion of documents.

The value of specifying a maximum threshold is that it will remove all terms that appear too frequently to provide meaningful variation. If, for instance, a term appears in more than 50% of documents, we may want to remove it by specifying a max_df of 0.50. In Listing 6-9, we compute the document-term matrix again, but this time allow for up to 1000 terms and also apply filtering to remove terms that appear in either more than 50% or fewer than 5% of documents.

If we print the shape of the C matrix, we can see that it does not appear that the document-term matrix was constrained by the maximum feature limit of 1000, since only 109 feature columns were returned. This may have been a consequence of our selection of maximum document frequency and minimum document frequency parameters, which eliminated terms that were unlikely to be useful for our purposes.

Another way in which we can perform filtering is to use the term-frequency inverse-document frequency (tf-idf) metric, which is shown in Equation 6-2.

Equation 6-2. Computing the term-frequency inverse-document frequency for column j.
$$ tfid{f}_j={sum}_i{c}_{ij}ast mathit{log}left(frac{N}{sum_i{1}_{left[{c}_{ij}>0
ight]}}
ight) $$
The tf-idf is computed for each feature, j, in the document-term matrix, C. It consists of the product of two components: (1) the frequency with which term j appears across all documents in the corpus, ∑icij, and (2) the natural logarithm of the document count, divided by the number of documents in which term j appears at least once, $$ N/{sum}_i{1}_{left[{c}_{ij}>0
ight]} $$. The tf-idf metric is increasing in the number of times j appears across the entire corpus and decreasing in the share of documents in which j appears. If j isn’t used frequently or is used in too many documents, the tf-idf score will be low.
# Instantiate vectorizer.
vectorizer = text.CountVectorizer(
        max_features = 1000,
        max_df = 0.50,
        min_df = 0.05
)
# Construct C matrix.
C = vectorizer.fit_transform(words)
# Print shape of C matrix.
print(C.toarray().shape)
(50, 109)
# Print terms.
print(vectorizer.get_feature_names()[:10])
['abil', 'activ', 'actual', 'affect', 'allin', 'also', 'america', 'anticip', 'approxim', 'avail']
Listing 6-9

Adjust the parameters of CountVectorizer()

In Listing 6-10, we repeat the same steps as in Listing 6-8, but we use a TfidfVectorizer() , rather than CountVectorizer() . This allows us to access the idf_ parameter, which contains the inverse document frequency scores. We can then optionally perform filtering by dropping columns with a tf-idf score below a certain threshold.
# Instantiate vectorizer.
vectorizer = text.TfidfVectorizer(max_features = 10)
# Construct C matrix.
C = vectorizer.fit_transform(words)
# Print inverse document frequencies.
print(vectorizer.idf_)
[2.36687628 1.84078318 3.14006616 2.2927683  2.44691898 2.22377543 1.8873032  2.22377543 2.22377543 2.2927683 ]
Listing 6-10

Compute inverse document frequencies for all columns

In some applications, we may want to use several words in a sequence (n-grams) – rather than individual words (unigrams) – as our features. We can do this by setting the ngram_range parameter of TfidfVectorizer() or CountVectorizer() . In Listing 6-11, we set the parameter to (2, 2), which means we only permit two-word sequences (bigrams). Note that the first value in the tuple is the minimum number of words and the second value is the maximum. We can see that the set of feature names returned is now different from the unigrams we generated in Listing 6-9.

In general, applying the bag-of-words model and computing a document-term matrix will be only the first step in a natural language processing project; however, it should be straightforward to see how such a matrix could be combined with standard tools from econometrics to perform analysis. If, for instance, we had a dependent variable associated with each document, such as stock returns for a firm on the same days as SEC filings, we could combine the two to train a predictive model or to test a hypothesis.
# Instantiate vectorizer.
vectorizer = text.TfidfVectorizer(
        max_features = 10,
        ngram_range = (2,2)
)
# Construct C matrix.
C = vectorizer.fit_transform(words)
# Print feature names.
print(vectorizer.get_feature_names())
['america gold', 'cosal oper', 'forwardlook inform', 'galena complex', 'gold silver', 'illeg blockad', 'oper result', 'recapit plan', 'relief canyon', 'silver corpor']
Listing 6-11

Compute the document-term matrix for unigrams and bigrams

Dictionary-Based Methods

In the previous sections, we cleaned and prepared data and then explored it using the bag-of-words model. This yielded a NxK document-term matrix, C, which consisted of word counts for each document. We filtered certain words of the document-term matrix, but otherwise remained agnostic about what features we wished to find in the text.

An alternative to this approach is to use a pre-selected “dictionary” of words, which is constructed to capture some latent feature in the text. Such approaches are often referred to as “dictionary-based methods” and are the most commonly used form of text analysis in economics.

An early application of dictionary-based methods in economics made use of latent “sentiment” in Wall Street Journal articles to study the relationship between news and stock market performance (Tetlock 2007). Later work, such as Loughran and McDonald (2011) and Apel and Blix Grimaldi (2014), introduced dictionaries that were designed to measure specific latent variables, which lead to their widespread use in the literature. Loughran and McDonald (2011) introduced a dictionary for 10-K financial filings, which was ultimately used to measure negative and positive sentiment in many contexts. Apel and Blix Grimaldi (2014) introduced a dictionary that measured “hawkishness” and “dovishness” in central bank communication.

Gentzkow et al. (2019) argue that economics and the social sciences should expand the set of tools they use for performing text analysis. Rather than using dictionary-based methods as a default choice, they should instead only be considered when the following two criteria are satisfied:
  1. 1.

    The prior information you have about the latent variable and how it is represented in text is strong and reliable.

     
  2. 2.

    The information about the latent variable in the text is weak and diffuse.

     

An ideal example of this is the Economic Policy Uncertainty (EPU) index introduced by Baker et al. (2016). The latent variable they wanted to measure was a theoretical object, which they captured in text by identifying the joint use of words that referred to the economy, policy, and uncertainty. Without specifying a dictionary for such an object, it is unlikely that it would emerge from a model as a common feature or topic. Additionally, having specified a dictionary, they demonstrated that it captured the underlying theoretical object by comparing EPU index scores with human ratings of the same newspaper articles.

Since dictionary-based methods are simple to implement and do not require the use of TensorFlow, we’ll demonstrate how they work with a single example involving the Loughran-McDonald (LM) dictionary. We’ll start by using pandas to load the LM dictionary in Listing 6-12.5 We’ll use the read_excel submodule from pandas and will specify the file path and the sheet name. Note that we’ve specified the “Positive” sheet, since we will exclusively make use of the dictionary of positive words in this example.
import pandas as pd
# Define data directory path.
data_path = '../data/chapter6/'
# Load the Loughran-McDonald dictionary.
lm = pd.read_excel(data_path+'LM2018.xlsx',
        sheet_name = 'Positive',
        header = None)
# Convert series to DataFrame.
lm = pd.DataFrame(lm.values, columns = ['Positive'])
# Convert to lower case.
lm = lm['Positive'].apply(lambda x: x.lower())
# Convert DataFrame to list.
lm = lm.tolist()
# Print list of positive words.
print(lm)
['able',
 'abundance',
 'abundant',
         ...
 'innovator',
        ...
 'winners',
 'winning',
 'worthy']
Listing 6-12

Compute the Loughran-McDonald measure of positive sentiment

Next, we’ll convert the pandas Series into a DataFrame and use the column header “Positive” for the dictionary. We’ll then use a lambda function to convert all of the words to lowercase, since they are uppercase in the LM dictionary. Finally, we’ll convert the DataFrame to a list object and then print. Looking at the last three terms, we can see that two of them – winners and winning – are likely to have the same word stem.

In general, we will typically either want to stem the dictionary and stem the corpus or stem neither. Since we have already stemmed the corpus – namely, the sentences from a 6-K filling – we’ll stem the dictionary too, dropping duplicate stems in the process. We do this in Listing 6-13.
from nltk.stem.porter import PorterStemmer
# Instantiate Porter stemmer.
stemmer = PorterStemmer()
# Apply Porter stemmer.
slm = [stemmer.stem(word) for word in lm]
# Print length of list.
print(len(slm))
354
# Drop duplicates by converting to set.
slm = list(set(slm))
# Print length of list.
print(len(slm))
151
Listing 6-13

Stem the LM dictionary

Following the steps we took earlier in the chapter, we’ll first instantiate a Porter stemmer and then apply it to each word in the dictionary using a list comprehension. The original list contains 354 words. If we then convert that list to a set, this will drop duplicate stems, reducing the number of dictionary terms to 151.

The next step is to take the words list, which contains the 50 sentences we extracted from a document, and count the instances of positive word stems. Recall that we cleaned and stemmed each of the words in a sentence – and then stored them as strings. We’ll need to iterate through each string, counting the number of times each of the positive words appears. We’ll do this in Listing 6-14.
# Define empty array to hold counts.
counts = []
# Iterate through all sentences.
for w in words:
        # Set initial count to 0.
        count = 0
        # Iterate over all dictionary words.
        for i in slm:
                count += w.count(i)
        # Append counts.
        counts.append(count)
Listing 6-14

Count positive words

In Listing 6-14, we started by defining an empty list to hold the counts. We then iterated over all strings that are contained in the words list in the outer loop. Whenever we started a new sentence, we set the positive word count to 0. We then stepped through the inner loop, which iterates over all words in the stemmed LM dictionary, counting the number of times they appear in the string and adding that to the total. We appended the total for each sentence to counts.

Figure 6-5 shows a histogram of the positive word counts. We can see that most sentences have none, whereas one sentence has more than ten. If we were to perform this analysis at the document level, as we typically will, we would most likely find a non-zero value for most 6-K filings.
../images/496662_1_En_6_Chapter/496662_1_En_6_Fig5_HTML.png
Figure 6-5

The distribution of positive word counts across sentence in a 6-K filling

In principle, we could take our positivity counts and include them as a feature in a regression. In practice, however, we will typically use a transformation of the count variable that has a more natural interpretation. If we did not have zero counts, we might use the natural logarithm of the count, allowing us to interpret the estimated effect as the impact on the percentage change in positivity. Alternatively, we could use the ratio of positive words to all words.

Finally, in economics and finance applications, it is common to use a net index, combining both positivity and negativity or “hawkishness” and “dovishness,” as is shown in Equation 6-3. Often, we will take the difference between the positive and negative word counts and then divide by a normalization factor. This factor may be the total word count for the document or the sum of the positive and negative terms.

Equation 6-3. Net positivity index .
$$ net positivitykern0.5em =frac{positivity- negativity}{normalization factor} $$

Word Embeddings

So far, we have used one-hot encoding (dummy variables) to construct numerical representations of words. One potential downside to this approach is that we implicitly assume that each pair of words is orthogonal. The words “inflation” and “prices,” for instance, are assumed to have no relationship to each other.

An alternative to using words as features is to instead use embeddings. In contrast to word vectors, which have a high-dimensional, sparse representation, word embeddings use a low-dimensional, dense representation. This dense representation allows us to identify the degree to which words are related.

Figure 6-6 provides a simple comparison of one-hot encoded words and dense word embeddings. The statement “…inflation rose sharply…” – which might appear in a central bank announcement – could be encoded using either approach. If we use the one-hot encoded approach, shown on the left of the diagram, each word will be translated into a sparse, high-dimensional vector. And each such vector will be orthogonal to all others. If, on the other hand, if we use embeddings, each word will be associated with a lower-dimensional, dense representation, shown on the right of Figure 6-6. The relationship between two such vectors is measurable and can be captured, for instance, using their inner product. The formula for the inner product of two vectors of dimension n – x and z – is given in Equation 6-4.

Equation 6-4. The inner product of two vectors, x and z.
$$ {x}^Tz={x}_0{z}_0+dots +{x}_n{z}_n $$
../images/496662_1_En_6_Chapter/496662_1_En_6_Fig6_HTML.png
Figure 6-6

Comparison of one-hot encoding and word embeddings

While the inner product may give us a compact summary of the relationship between the two words, it does not provide more granular information about how two embedding vectors are related. For this, we can directly compare elements in the same position for a pair of vectors. Such elements provide a measurement for the same feature. While we might not be able to identify what the underlying feature is, we know that having similar values in the same position indicates two words are related along that dimension.

In contrast to one-hot encoding, we will need to use some supervised or unsupervised method to train embeddings. Since embeddings need to capture meaning in words and the relationships between words, it will often not make sense to do the training ourselves. Among other things, the embedding layer will need to learn the language in which you are performing your analysis, and the corpus you provide will almost certainly be insufficient for that task.

For this reason, you will often instead use pretrained word embeddings. Common choices include Word2Vec (Mikolov et al. 2013) and Global Vector for Word Representation (Pennington et al. 2014).

Notice that there is a strong analogy between word embeddings and convolutional layers. With convolutional layers, we said that they included general vision filters. For this reason, it often made sense to use convolutional layers from a model pretrained on millions of images. Additionally, we said that it was possible to “fine-tune” the training of such models to improve local performance on your particular image classification task. The same is also true with word embeddings.

Topic Modeling

The purpose of a topic model is to uncover a latent set of topics in a corpus and to determine the extent to which those topics are present in individual documents. The first topic model, the latent Dirichlet allocation (LDA) , was introduced to the machine learning literature in Blei et al. (2003) and has since found applications in many areas, including economics and finance.

While TensorFlow does not provide an implementation for standard workhorse topic models, it is the framework of choice for many sophisticated topic models. In general, a topic model will be more likely to be implemented in TensorFlow if it makes use of deep learning.

Since topic modeling is seeing increased use in economics, we will provide a brief introduction in this section, even though we will not make use of TensorFlow. We’ll start with a theoretical overview of the static LDA model Blei et al. (2003), followed by a description of how to implement and tune it using sklearn. We’ll will close the section by discussing recently-introduced variants of the model.

In Blei et al. (2003), the LDA model is described as follows:

a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.

There are a few concepts worth explaining, since they will reappear throughout this chapter and text. First, the model is “generative” because it generates a novel output – the topic distribution – rather than performing a discriminative task, such as learning a classification for a document. Second, it is “probabilistic” because the model is explicitly grounded in probability theory and yields probabilities. And third, we say that topics are “latent” in that they are not explicitly measured or labeled, but are assumed to be an underlying feature of documents.

While we won’t discuss the details of solving an LDA model, we’ll briefly summarize the assumptions underlying the model in Blei et al. (2003), starting with notation. First, they assume that words are drawn from a fixed vocabulary of length V and represent them using one-hot encoded vectors. Next, they define a document as a sequence of N words, w = (w1, w2, …, wN). Finally, they define a corpus as collection of documents, D = {w1, w2, …, wM}.

The model makes three assumptions about the underlying process that generates a document, w, in a corpus, D:
  1. 1.

    The number of words, N, in each document, w, is drawn from a Poisson distribution.

     
  2. 2.

    The latent topics are drawn from a k-dimensional random variable, θ, which has a Dirichlet distribution: θ~Dir(α).

     
  3. 3.

    For each word, n, a topic, zn, is drawn from a multinomial distribution that is conditional on θ. The word itself is then drawn from a multinomial distribution, conditional on the topic, zn.

     

The authors argue that the Poisson distribution of word counts is not an important assumption and that it would be better to use a more realistic assumption. The choice of the Dirichlet distribution constrains θ to a (k-1)-dimensional simplex. It also provides a multivariate generalization of the beta distribution and is parameterized by a k-vector of positive-valued weights, α. Blei et al. (2003) choose the Dirichlet distribution for three reasons: “…it is in the exponential family, has finite dimensional sufficient statistics, and is conjugate to the multinomial distribution.” They argued that this would ensure its suitability in estimation and inference algorithms.

The probability density of the topic distribution, θ, is given in Equation 6-5.

Equation 6-5. The distribution of topics.
$$ pleft(	heta |alpha 
ight)=frac{Gamma left({sum}_{mathrm{i}}{upalpha}_{mathrm{i}}
ight)}{Pi_{mathrm{i}}Gamma left({alpha}_i
ight)}{	heta}_1^{alpha_1-1}cdots {	heta}_k^{alpha_k-1} $$
In Figure 6-7, we provide a visual illustration of 100 random draws from the Dirichlet distribution in the case where k = 2. In the left panel of Figure 6-7, we set α = [0.9, 0.1], and in the right panel, we set α = [0.5, 0.5]. In both cases, all points are located on the simplex. That is, summing the coordinates associated with any point will yield 1. Additionally, we can see that choosing identical values of α0 and α1 yields evenly distributed points along the simplex, whereas increasing the relative value of α0 results in a skew toward the horizontal axis (i.e., topic θk).
../images/496662_1_En_6_Chapter/496662_1_En_6_Fig7_HTML.png
Figure 6-7

Plot of random draws from Dirichlet distribution with k=2 and parameter vectors [0.9, 0.1] (left) and [0.5, 0.5] (right)

We’ll next implement an LDA model, making use of the document corpus we constructed earlier by dividing a 6-K filing into sentences. Recall that we defined a document-term matrix, C, using CountVectorizer . We’ll make use of both in these in Listing 6-15, where we start by importing LatentDirichletAllocation from sklearn.decomposition. Next, we instantiate a model with our preferred parameter values. In this case, we will only set the number of topics, n_components. This corresponds to the k parameter in the theoretical model.

We can now train the model on the document-term matrix and recover the output, wordDist, using lda.components_. Note that wordDist has shape (3, 109). The rows correspond to latent topics, and the columns correspond to weights. The higher the weight, the more important a word is for defining a topic.6

We’ll next make use of the output, wordDist , to identify the words with the highest weights to for each topic. We’ll define an empty list, topics, to hold the topics. Within a list comprehension, we’ll step through each topic array and apply argsort() to recover the indices that would sort the array. We’ll then recover the last five indices and reverse their order.

For each index, we’ll identify the associated term by making use of feature_names, which we recovered from vectorizer. We’ll then print the list of topics.

A complete description of a topic consists of a vector of weights over the vocabulary. We can choose how such a topic is described by determining which words have weights that are sufficiently high to justify their inclusion in the topic’s description. In this case, we have simply used the five words with the highest weights; however, in principle, we could have used a threshold value or some other criterion.
from sklearn.decomposition import LatentDirichletAllocation
# Set number of topics.
k = 5
# Instantiate LDA model.
lda = LatentDirichletAllocation(n_components = k)
# Recover feature names from vectorizer.
feature_names = vectorizer.get_feature_names()
# Train model on document-term matrix.
lda.fit(C)
# Recover word distribution for each topic.
wordDist = lda.components_
# Define empty topic list.
topics = []
# Recover topics.
for i in range(k):
        topics.append([feature_names[name] for
        name in wordDist[i].argsort()[-5:][::-1]])
# Print list of topics.
print(topics)
[['inform', 'america', 'gold', 'forwardlook', 'result'],
 ['oper', 'compani', 'product', 'includ', 'relief'],
 ['silver', 'lead', 'cost', 'ounc', 'galena']]
Listing 6-15

Perform LDA on 6-K filing text data

Now that we have identified topics, the next step is to determine what those topics describe. In our simple example, we recovered three topics. The first appears to reference forward-looking information related to gold. The second appears to involve company operations and production. And the third topic is concerned with the cost of metals.

Finally, we complete the exercise by using the transform() method of our model to assign topic probabilities to sentences in Listing 6-16.
# Transform C matrix into topic probabilities.
topic_probs = lda.transform(C)
# Print topic probabilities.
print(topic_probs)
array([[0.0150523 , 0.97172408, 0.01322362],
       [0.02115127, 0.599319  , 0.37952974],
       [0.33333333, 0.33333333, 0.33333333],
                                ...
       [0.93766165, 0.03140632, 0.03093203],
       [0.08632993, 0.82749933, 0.08617074],
       [0.95509882, 0.02178363, 0.02311755]])
Listing 6-16

Assign topic probabilities to sentences

The output, as we can see in Listing 6-16, is a matrix of shape (3, 50), which contains topic probabilities that sum to one for each sentence. If, for instance, we had collected separate 6-K filings for each date, rather than looking at sentences within a filing, we’d now have the time series of topic proportions.

We’ve also plotted the topic proportions in Figure 6-8. We can see that there appears to be persistence in topics across sentences in the document document. For instance, topic 1 is dominant at the start and end of the document, and topic 3 rises in importance in the middle.
../images/496662_1_En_6_Chapter/496662_1_En_6_Fig8_HTML.png
Figure 6-8

Topic proportions by sentence

While we considered a simple example that did not require a careful choice of model or training parameters, the LDA implementation in sklearn does, in fact, permit the choice of a variety of different parameters. We consider six of those parameters in the following:
  1. 1.

    Topic prior: By default, the LDA model will use 1/n_components as the prior for all elements in α. You can, however, supply a different prior by explicitly providing a topic distribution for the parameter doc_topic_prior.

     
  2. 2.

    Learning method : By default, the LDA model in sklearn will use variational Bayes to train the model and will make use of the full sample to perform each update. It is, however, possible to train in mini-batches by setting the learning_method parameter to 'online'.

     
  3. 3.

    Batch size : Conditional on using online training, you will also have the option to change the mini-batch size from its default value of 128. You can do this using the batch_size parameter.

     
  4. 4.

    Learning decay : When using the online learning method, the learning_decay parameter can be used to adjust the learning rate. A higher value of decay lowers the information we retain from previous iterations. The default value is 0.7, and the documentation recommends selecting a decay in the (0.5, 1] interval.

     
  5. 5.

    Maximum number of iterations: Setting a maximum number of iterations will terminate the training process after that threshold has been reached. By default, the max_iter parameter is set to 128. If the model does not appear to converge within 128 iterations, you may want to set a higher value for this parameter.

     

Finally, two limitations of the standard LDA model introduced by Blei et al. (2003) merit discussion. First, neither the number nor content of topics may vary over the corpus. For many problems, this is not an issue; however, for applications in economics and finance that involve a time series dimension, this can be quite problematic, as we will expand on in the following paragraph. And second, the LDA model does not provide any meaningful control over the topics extracted. If we wish to track specific types of events in the data, we may not be able to do that using an LDA model, since there is no guarantee that it will identify those events.

With respect to the first problem – namely, using an LDA in time series contexts – two issues may arise. First, the model will censor topics that appear only briefly, such as financial crises, even if they are quite important during the period in which they appear. And second, it will introduce a “look-ahead” bias in the topic distribution by forcing topics that emerge in the future to also be topics in the entire sample. This can create the impression that the LDA model would have predicted events that it would not have if the sample were truncated at the date of the event.

With respect to the second problem, LDA presents two issues. The first is that we do not have the possibility to guide the model toward topics of interest. We cannot, for instance, submit topic queries to the LDA model. The second issue is that the topics the model does generate are often challenging to interpret. This is because a topic is simply a distribution over all words in the vocabulary. We will often be unable to determine what exactly a topic is without studying the distribution and examining the documents in which it is determined to be dominant.

There are, however, more recently developed models that attempt to overcome the limitations of the static LDA model. Blei and Lafferty (2006), for instance, introduce a dynamic version of the topic model. Additionally, Dieng et al. (2019) extend this further by introducing a dynamic embedded topic model (D-ETM). This model is dynamic, permits the use of a large vocabulary, and tends to generate more interpretable topics. This solves both of the issues related to the original static LDA model.

Text Regression

As Gentzkow et al. (2019) discuss, most text analysis within economics and finance centers around the bag-of-words model and dictionary-based methods. While these techniques are useful under certain circumstances, they are not the best tool for all research questions. Consequently, many projects that involve text analysis in economics could likely be improved by making use of different methods from natural language processing.

One option is to use a text regression, which is simply a regression model that includes text features, such as columns from the term-document matrix, as regressors. Gentzkow et al. (2019) argue that text regression is a good candidate method for economists to adopt. This is because economists primarily use linear regression for empirical work and often have familiarity with penalized linear regression. Thus, learning how to perform a text regression is mostly about constructing the document-term matrix, not learning how to estimate a regression.

We’ll start this section by performing a simple text regression in TensorFlow. To do this, we’ll need to construct the document-term matrix and a continuous dependent variable. Rather than using sentences within a 6-K filing, we’ll use all 8-K filings for Apple in the SEC’s system to construct the document-term matrix.7 We’ll then use the daily percentage change in Apple’s stock price on the day of the filing as the dependent variable.

For the sake of brevity, we’ll omit the details of the data collection process other than to say that we performed the same steps discussed earlier in the chapter to produce a document-term matrix, x_train, and stored the stock returns data as y_train. In total, we made use of 144 filings and extracted 25 unigram counts to construct x_train.

Recall from Chapter 3 that a linear model with k regressors has the form given in Equation 6-6. In this case, the k regressors are the feature counts from the document-term matrix. Note that we index documents using t, since we are using a time series of filings and returns.

Equation 6-6. A linear model .
$$ {Y}_t=alpha +{eta}_0{X}_{t0}+dots +{eta}_{k-1}{X}_{tk-1} $$

We could, of course, make use of OLS and solve for the parameter vector with an analytical expression. However, for the sake of building toward models that are not analytically tractable, we’ll instead make use of a LAD regression. In Listing 6-17, we import tensorflow and numpy, initialize a constant term (alpha) and the vector of coefficients (beta), transform x_train and y_train into numpy arrays, and then define a function (LAD), which transforms the parameters and data into predictions.

Recall that we must define the parameters we wish to train using tf.Variable() and can use either np.array() or tf.constant() to define data.
import tensorflow as tf
import numpy as np
# Draw initial values randomly.
alpha = tf.random.normal([1], stddev=1.0)
beta = tf.random.normal([25,1], stddev=1.0)
# Define variables.
alpha = tf.Variable(alpha, tf.float32)
beta = tf.Variable(beta, tf.float32)
# Convert data to numpy arrays.
x_train = np.array(x_train, np.float32)
y_train = np.array(y_train, np.float32)
# Define LAD model.
def LAD(alpha, beta, x_train):
        prediction = alpha + tf.matmul(x_train, beta)
        return prediction
Listing 6-17

Prepare the data and model for a LAD regression in TensorFlow

The next steps are to define a loss function and perform minimization, which we do in Listing 6-18. We will use a mean absolute error (MAE) loss, since we’re performing a LAD regression. We’ll then instantiate an Adam() optimizer with default parameter values. Finally, we’ll perform 1000 training iterations.
# Define number of observations.
N = len(x_train)
# Define function to compute MAE loss.
def maeLoss(alpha, beta, x_train, y_train):
        y_hat = LAD(alpha, beta, x_train)
        y_hat = tf.reshape(y_hat, (N,))
        return tf.losses.mae(y_train, y_hat)
# Instantiate optimizer.
opt = tf.optimizers.Adam()
# Perform optimization.
for i in range(1000):
        opt.minimize(lambda: maeLoss(alpha, beta,
        x_train, y_train),
        var_list = [alpha, beta])
Listing 6-18

Define an MAE loss function and perform optimization

Now that we’ve trained a model, we can feed arbitrary inputs into the LAD function, which will yield predicted values. We’ll do that using x_train to generate predictions, y_pred, for y_train in Listing 6-19.
# Generate predicted values.
y_pred = LAD(alpha, beta, x_train)
Listing 6-19

Generate predicted values from model

We plot the predicted values against the true values in Figure 6-9. The constant term matches the mean return, and the predictions appear to capture the direction of most changes correctly; however, the model generally fails to explain much of the variation in the data.
../images/496662_1_En_6_Chapter/496662_1_En_6_Fig9_HTML.jpg
Figure 6-9

True and predicted values of Apple stock returns

There are several reasons that are unrelated to natural language processing that are likely to explain our inability to explain much of the variation in the data using the model. First, the 1-day time window could be too large and may capture developments unrelated to the announcement effect. Indeed, much of the literature in economics on the subject has moved to concentrating on narrower windows around announcements, such as 30 minutes. Second, we didn't include any non-text features in the regression, such as lagged returns, returns from the entire tech sector, or data releases from statistical agencies. And third, predicting surprise returns is challenging and even good models will typically fail to explain most of the variation in the data.

For the sake of this exercise, however, let’s put all of that aside and consider how we might improve prediction purely from NLP on announcements. A good place to start might be to question whether we selected unigrams that contained meaningful content for explaining returns. Given that we uncritically accepted the 25 features selected by the CountVectorizer() , it is possible that a more thoughtful selection of features could lead to an improvement. Recall that we can extract the features from vectorizer using the get_feature_names() method. In Listing 6-20, we do this and then print the unigrams extracted from texts.
# Get feature names from vectorizer.
feature_names = vectorizer.get_feature_names()
# Print feature names.
print(feature_names)
['act', 'action', 'amend', 'amount', 'board',
'date', 'director', 'incom', 'law', 'made',
'make', 'net', 'note', 'offic', 'order',
'parti', 'price', 'product', 'quarter', 'refer',
'requir', 'respect', 'section', 'state', 'term']
Listing 6-20

Generate predicted values from model

Many terms, as we can see in Listing 6-20, appear to be neutral. Depending on how they are modified in the text, they could predict either a positive or a negative return. If the model were able to treat the uses in their proper contexts, it might assign a large magnitude to the correctly signed feature.

We might try to fix this by expanding the set of features, performing more extensive filtering to determine the features we include, or changing the model specification to allow for non-linearities, such as feature interactions. Since we have already covered cleaning and filtering, we’ll focus on the expansion of features and the inclusion of non-linearities.

Given that the training set only contains 144 observations, we might be concerned that including more features will lead to training sample improvements, but through overfitting. This is a valid concern, and we will overcome it by using a penalized regression model. The penalty will be such that including more parameters with non-zero values will lower the value of the loss function. Thus, if the parameters do not provide considerable predictive value, we will zero them out or assign low magnitudes to them.

Gentzkow et al. (2019) define a general penalized estimator as the solution to the minimization problem in Equation 6-7.

Equation 6-7. The minimization problem for a penalized estimator .
$$ min left{lleft(alpha, eta 
ight)+lambda {sum}_j{kappa}_jleft(left|{eta}_j
ight|
ight)
ight} $$

Note that l(α, β) is a loss function, such as the MAE loss for a linear regression, λ scales the magnitude of the penalty, and κj(·) is an increasing penalty function that could, in principle, differ by parameter; however, in practice, we will often assume it is identical for all regressors.

There are three types of penalized regression we will often encounter, each of which is defined by the associated choice of κ(·):
  1. 1.

    LASSO regression: The least absolute shrinkage and selection parameter (LASSO) model uses the L1 norm of β, reducing κ to an absolute value or ∣βj∣ for all j. The functional form of the penalty in a LASSO regression will force certain parameter values to 0, yielding a sparse parameter vector.

     
  2. 2.

    Ridge regression : A ridge regression uses the L2 norm of β, yielding $$ kappa left({eta}_j
ight)={eta}_j^2 $$. Unlike a LASSO regression, a ridge regression will yield a dense representation of β with coefficients not set precisely to zero. Since the penalty term of a ridge regression is a convex function, it will yield a unique minimum.

     
  3. 3.

    Elastic net regression : An elastic net regression combines both the LASSO and ridge regression penalties. That is, $$ kappa left({eta}_j
ight)={kappa}_1left|{eta}_j
ight|+{kappa}_2{eta}_j^2 $$ for all j.

     

The minimization problems for LASSO, ridge, and elastic net regressions are given in Equations 6-8, 6-9, and 6-10, respectively.

Equation 6-8. The minimization problem for a LASSO regression.
$$ min left{lleft(alpha, eta 
ight)+lambda {sum}_j|{eta}_j|
ight} $$
Equation 6-9. The minimization problem for a ridge regression .
$$ min left{lleft(alpha, eta 
ight)+lambda {sum}_j{eta}_j^2
ight} $$
Equation 6-10. The minimization problem for an elastic net regression.
$$ min left{lleft(alpha, eta 
ight)+lambda {sum}_jleft[{kappa}_1left|{eta}_j
ight|+{kappa}_2{eta}_j^2
ight]
ight} $$

We will return to the Apple stock returns prediction problem, but will now make use of a LASSO regression, which will yield a sparse coefficient vector. In our case, there were many neutral terms that likely added minimal value in a linear model, where they couldn’t be modified by adjectives. By using a LASSO regression, we’ll allow the model to decide whether to ignore them entirely by assigning a zero weight.

Before we modify the model, we’ll first apply CountVectorizer() again, but this time, we’ll construct a document-term matrix for 1000 terms, rather than 25. For the sake of brevity, we’ll omit the details and will instead start at the end of the process, where feature_names contains 1000 elements and x_train has the shape (144, 1000).

Next, in Listing 6-21, we’ll re-define beta; set the magnitude of the penalty, lam; and re-define the loss function, which we’ll now call lassoLoss() . Notice that the only difference is that we’ve added a term that consists of lam, multiplied by the L1 norm of beta. Beyond that, nothing else changed. We still use the LAD function to make predictions, just as we did with the linear regression model.
# Re-define coefficient vector.
beta = tf.random.normal([1000,1], stddev=1.0)
# Set value of lambda parameter.
lam = tf.constant(0.10, tf.float32)
# Modify the loss function.
def lassoLoss(alpha, beta, x_train, y_train,
lam = lam):
        y_hat = LAD(alpha, beta, x_train)
        y_hat = tf.reshape(y_hat, (N,))
        loss = tf.losses.mae(y_train, y_hat) +
        lam * tf.norm(beta, 1)
        return loss
Listing 6-21

Convert a LAD regression into a LASSO regression

In Listing 6-22, we’ll repeat the steps to train the model using the modified loss function and generate predictions on the training set.
# Perform optimization.
for i in range(1000):
        opt.minimize(lambda: lassoLoss(alpha, beta,
        x_train, y_train),
        var_list = [alpha, beta])
# Generate predicted values.
y_pred = LAD(alpha, beta, x_train)
Listing 6-22

Train a LASSO model

Now that we have the predicted values from the LASSO model, we can perform a comparison with the true returns. Figure 6-10 depicts this comparison, providing an update to Figure 6-9, which conducted the same exercise, but for the LAD model without a penalty term and with only 25 features.

We can see that performance has substantially improved under the LASSO model with 1000 features; however, we might worry that the penalty magnitude we selected wasn’t sufficiently severe and that the model is overfitting. To evaluate this, we can adjust lam to higher values and check the model’s performance. Furthermore, we can perform cross-validation using a test set; however, this will be somewhat more challenging in a time series context with only 144 observations.

For now, we recall that a LASSO regression returns a sparse coefficient vector and will examine how many coefficients have non-zero values. Figure 6-11 plots the histogram of the coefficient magnitudes. From this, we can see that over 800 features were assigned values of approximately zero. While we still have enough features to be concerned about overfitting, this is less concerning, given that most of the 1000 features were ignored by the model as a consequence of the penalty function.
../images/496662_1_En_6_Chapter/496662_1_En_6_Fig10_HTML.png
Figure 6-10

True and predicted values of Apple stock returns using a LASSO model

We’ve now seen that we can make use of additional features in a regression by employing a form of regularization (i.e., a penalty function). The penalty function prevents us from simply adding more parameters to improve fit. Doing this will increase the penalty, which will force parameter values to justify their inclusion in the model by substantially improving fit. This also means that we’ll be able to include many more features and allow the model to sort out which should be assigned non-zero magnitudes.
../images/496662_1_En_6_Chapter/496662_1_En_6_Fig11_HTML.png
Figure 6-11

True and predicted values of Apple stock returns using a LASSO model

We mentioned earlier that using a LASSO model allowed us to expand the feature set, which was one way to improve performance. Another option we mentioned was to allow for dependence between words. We can do this by permitting non-linearities in the model. In principle, we could engineer these features. We could also make use of any non-linear model to perform such a task. Furthermore, we could couple this with a penalty term, just as we did with the LASSO model, to avoid overfitting.

While these are viable strategies and can be implemented with relative ease in TensorFlow, we’ll instead make use of a more general option: deep learning. We have already discussed deep learning in the context of images in Chapter 5, but we return to it here because it provides a flexible and potent modeling strategy for most text regression problems.

The distinction between “deep learning” (e.g., neural networks) and “shallow learning” (e.g., linear regression) is that shallow learning models require us to perform feature engineering. For instance, in a linear text regression, we must decide which features are in the document-term matrix (e.g., unigrams or bigrams). We must also decide how many features to allow in the model. The model will determine which are most important for explaining variation in the data, but we must choose which to include.

Recall, again, that this was not the case with images. We input pixel values into convolutional neural networks and those networks identified successive layers of increasingly complex features. First, the networks identified edges. In the next layer, they identified corners. Each successive layer built on the previous one to identify new features that were useful for the classification task.

Deep learning can also be used in the same way for text. Rather than deciding how terms relate to each other through the use of feature engineering, we can allow a neural network to uncover these relationships for us. Just as we did in Chapter 5, we’ll make use of the high-level Keras API in TensorFlow.

In Listing 6-23, we define a neural network with dense layers that we’ll use to predict stock returns for Apple. There is only one substantive difference between this network and the dense layer-based image networks we defined in Chapter 5: the use of dropout layers. Here, we have included two such layers, each of which has a rate of 0.20. During the training phase, this will randomly drop 20% of the nodes, forcing the model to learn robust relationships, rather than using the high number of model parameters to memorize output values.8

In addition to this, notice that we’ve defined the model to accept an input with 1000 feature columns, which is the number we’ve included in our document-term matrix . We also use relu activation functions for all hidden layers. Additionally, we use a linear activation function in the outputs layer, since we have a continuous target (stock returns).
import tensorflow as tf
# Define input layer.
inputs = tf.keras.Input(shape=(1000,))
# Define dense layer.
dense0 = tf.keras.layers.Dense(64,
        activation="relu")(inputs)
# Define dropout layer.
dropout0 = tf.keras.layers.Dropout(0.20)(dense0)
# Define dense layer.
dense1 = tf.keras.layers.Dense(32,
        activation="relu")(dropout0)
# Define dropout layer.
dropout1 = tf.keras.layers.Dropout(0.20)(dense1)
# Define output layer.
outputs = tf.keras.layers.Dense(1,
        activation="linear")(dropout1)
# Define model using inputs and outputs.
model = tf.keras.Model(inputs=inputs,
        outputs=outputs)
Listing 6-23

Define a deep learning model for text using the Keras API

The architecture we’ve selected will require us to train many parameters. Recall that we can check this using the summary() method of a keras model, which we do in Listing 6-24. In total, the model has 66,177 trainable parameters.

With the LASSO model, we were already concerned about overfitting, even though the model only had 1001 parameters and the penalty function effectively forced 850 of them to be zero. We now have a model with 66,177 parameters, which should make us even more concerned about overfitting. This is why we’ve used a form of regularization (dropout) and why we’ll also use a training and validation sample.
# Print model architecture.
print(model.summary())
_____________________________________________________
Layer (type)           Output Shape          Param #
=====================================================
input_3 (InputLayer)  [(None, 1000)]            0
_____________________________________________________
dense_5 (Dense)         (None, 64)            64064
_____________________________________________________
dropout_1 (Dropout)     (None, 64)              0
_____________________________________________________
dense_6 (Dense)         (None, 32)             2080
_____________________________________________________
dropout_2 (Dropout)     (None, 32)              0
_____________________________________________________
dense_7 (Dense)         (None, 1)               33
=====================================================
Total params: 66,177
Trainable params: 66,177
Non-trainable params: 0
_____________________________________________________
Listing 6-24

Summarize the architecture of a Keras model

Recall that, in addition to defining a model, we’ll need to compile it. We’ll do that and train the model in Listing 6-25. Notice that we use the Adam optimizer, the mean absolute error (MAE) loss, and a validation split of 30% of the sample. We’ll also use 20 epochs.
# Compile the model.
model.compile(loss="mae", optimizer="adam")
# Train the model.
model.fit(x_train, y_train, epochs=20,
batch_size=32, validation_split = 0.30)
Epoch 1/20
100/100 [==============================] - 0s 5ms/sample - loss: 2.6408 - val_loss: 2.5870
...
Epoch 10/20
100/100 [==============================] - 0s 117us/sample - loss: 1.7183 - val_loss: 1.3514
...
Epoch 15/20
100/100 [==============================] - 0s 110us/sample - loss: 1.6641 - val_loss: 1.2014
...
Epoch 20/20
100/100 [==============================] - 0s 113us/sample - loss: 1.5932 - val_loss: 1.2536
Listing 6-25

Compile and train the Keras model

As we can see in Listing 6-25, training initially reduces the loss for both the training and validation split; however, by the 15th epoch, the loss on the training split continues to decline while the loss on the validation split begins to increase slightly. This suggests that we might be starting to overfit.

Figure 6-12 repeats the prediction exercise for the returns, using the predict() method of model. While the predictions appear to be an improvement over the linear and LASSO regressions, it is likely that part of this gain is attributable to overfitting.
../images/496662_1_En_6_Chapter/496662_1_En_6_Fig12_HTML.png
Figure 6-12

True and predicted values of Apple stock returns using a neural network

If we want to reduce the risk of overfitting even further, we could increase the rates in our two dropout layers or decrease the number of nodes in the hidden layers.

Finally, note that making use of word sequences, rather than ignoring the order in which words appear, can lead to substantial improvements in model performance. This will require the use of recurrent neural networks and their variants, including the long short-term memory (LSTM) model. Since we will use the same family of models to perform time series analysis, we’ll delay their introduction to Chapter 7.

Text Classification

In the previous section, we discussed how TensorFlow could be used to perform text regression. Once we had constructed the document-term matrix, we saw that it was relatively simple to perform a LAD regression and a LASSO regression and to train a neural network. In some cases, however, we will have a discrete target and will want to perform classification instead. Fortunately, TensorFlow provides us with the flexibility to perform classification tasks by making minor adjustments to models we have already defined.

Listing 6-26, for instance, shows how we can define a logistic model to perform classification. We’ll assume we’re using the same document-term matrix, x_train, but have now replaced y_train with hand-classified labels that we produced by reading the individual 8-K filings and then classifying them as “positive” or “negative” based on our perception of the content. A positive score will be indicated by a 0 and a negative by a 1.
# Define a logistic model.
def logitModel(x_train, beta, alpha):
        prediction = tf.nn.softmax(tf.matmul(
        x_train, beta) + alpha)
        return prediction
Listing 6-26

Define a logistic model to perform classification in TensorFlow

In addition to the changes to model definition, we’ll also need to modify the loss function to use the binary cross-entropy loss, which we do in Listing 6-27. After that, we’ll only need to change the function handle when we perform optimization. Everything else will work as it did for the linear regression example.
# Define number of observations.
N = len(x_train)
# Define function to compute MAE loss.
def logisticLoss(alpha, beta, x_train, y_train):
        y_hat = LogitModel(alpha, beta, x_train)
        y_hat = tf.reshape(y_hat, (N,))
        loss = tf.losses.binary_crossentropy(
        y_train, y_hat)
        return loss
Listing 6-27

Define a loss function for the logistic model to perform classification in TensorFlow

Similarly, if we want to perform classification with the neural network we defined in Listing 6-23, we’ll only need to modify two lines of code, as is shown in Listing 6-28.
# Change output layer to use sigmoid activation.
outputs = tf.keras.layers.Dense(1,
        activation="sigmoid")(dropout1)
# Use categorical cross entropy loss in compilation.
model.compile(loss="binary_crossentropy", optimizer="adam")
Listing 6-28

Modify a neural network to perform classification

We changed two things: the activation function used in the outputs layer and loss function. First, we needed to use a sigmoid activation function, since we’re performing classification with two classes. And second, we used the binary_crossentropy loss, which is standard for classification problems with two classes. If we instead had a problem with multiple classes, we’d use a softmax activation function and a categorical_crossentropy loss.

For an extended overview of classification with neural networks, see Chapter 5, which covers similar material, but in the context of image classification problems. Additionally, for information about sequential models, which are commonly used for text classification problems, see Chapter 7, which makes use of the same models for time series analysis.

Summary

This chapter provided an extended overview of how text analysis is currently used in economics and finance and how it might be used in the future. The part of the process that is likely to be least familiar for economists is the data cleaning and preparation step, which transforms text into numerical data. The simplest version of this was the bag-of-words model, which stripped words from their context and summarized the content of a document using word counts alone. While this method is relatively simple to implement, it is powerful and remains one of the more commonly used methods in economics.

Dictionary-based methods also work on the bag-of-words model. However, rather than counting all terms in a document, we instead construct a dictionary that measures a latent variable. Such methods are frequently used in text analysis in economics, but are not always the best tool for many research applications, as Gentzkow et al. (2019) discuss. The EPU index (Baker, Bloom and Davis 2016) is arguably an ideal use case for dictionary-based methods in economics, since the measure is interesting for theoretical purposes, but is unlikely to emerge as a dominant topic from a corpus.

We also discussed word embeddings and saw how to implement topic models, text regression models, and text classification models. This included an overview of using deep learning models for text. We did, however, defer the discussion of sequential models to Chapter 7, which uses them for time series analysis.

Bibliography

Acosta, M. 2019. “A New Measure of Central Bank Transparency and Implications for the Effectiveness of Monetary Policy.” Working Paper.

Angelico, C., J. Marcucci, M. Miccoli, and F. Quarta. 2018. “Can We Measure Inflation Expectations Using Twitter?” Working Paper.

Apel, M., and M. Blix Grimaldi. 2014. “How Informative Are Central Bank Minutes?” Review of Economics 65 (1): 53–76.

Ardizzi, G., S. Emiliozzi, J. Marcucci, and L. Monteforte. 2020. “News and Consumer Card Payments.” Bank of Italy Economic Working Paper.

Armelius, H., C. Bertsch, I. Hull, and X. Zhang. 2020. “Spread the Word: International Spillovers from Central Bank Communication.” Journal of International Money and Finance (103).

Athey, S., and G.W. Imbens. 2019. “Machine Learning Methods that Economists Should Know About.” Annual Review of Economics 11: 685–725.

Baker, S.R., N. Bloom, and S.J. Davis. 2016. “Measuring Economic Policy Uncertainty.” Quarterly Journal of Economics 131 (4): 1593–1636.

Bertsch, C., I. Hull, Y. Qi, and X. Zhang. 2020. “Bank Misconduct and Online Lending.” Journal of Banking and Finance 116.

Blei, D.M., A.Y. Ng, and M.I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 993–1022.

Blei, D.M., and J.D. Lafferty. 2006. “Dynamic Topic Models.” ICML '06: Proceedings of the 23rd international conference on Machine Learning. 113–120.

Bloom, N., S.R. Baker, S.J. Davis, and K. Kost. 2019. “Policy News and Stock Market Volatility.” Mimeo.

Born, B., B. Ehrmann, and M. Fratzcher. 2013. “Central Bank Communication on Financial Stability.” The Economic Journal 124.

Cerchiello, P., G. Nicola, S. Ronnqvist, and P. Sarlin. 2017. “Deep Learning Bank Distress from News and Numerical Financial Data.” arXiv.

Chahrour, R., K. Nimark, and S. Pitschner. 2019. “Sectoral Media Focus and Aggregate Fluctuations.” Working Paper.

Correa, R., K. Garud, J.M. Londono, and N. Mislang. 2020. “Sentiment in Central Banks’ Financial Stability Reports.” International Finance Discussion Papers 1203, Board of Governors of the Federal Reserve System (U.S.).

Dieng, A.B., F.J.R. Ruiz, and D.M. Blei. 2019. “ The Dynamic Embedded Topic Model.” arXiv preprint.

Gentzkow, M., B. Kelly, and M. Taddy. 2019. “Text as Data.” Journal of Economic Literature 57 (3): 535–574.

Hansen, S., and M. McMahon. 2016. “Shocking Language: Understanding the Macroeconomic Effects of Central Bank Communication.” Journal of International Economics 99.

Hansen, S., M. McMahon, and A. Prat. 2018. “Transparency and Deliberation within the FOMC: A Computational Linguistics Approach.” Quarterly Journal of Economics 133: 801–870.

Harris, Z. 1954. “Distributional Structure.” Word 10 (2/3): 146–162.

Hollrah, C.A., S.A. Sharpe, and N.R. Sinha. 2018. “What’s the Story? A New Perspective on the Value of Economic Forecasts.” Finance and Economics Discussion Series 2017-107, Board of Governors of Federal Reserve System (U.S.).

Kalamara, E., A. Turrell, C. Redl, G. Kapetanios, and S. Kapadia. 2020. “Making Text Count: Economic Forecasting Using Newspaper Text.” Bank of England Staff Working Paper No. 865.

LeNail, A. 2019. “NN-SVG: Publication-Ready Neural Network Architecture Schematics.” Journal of Open Source Software 4 (33).

Loughran, T., and B. McDonald. 2011. “When is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” Journal of Finance 66 (1): 35–65.

Mikolov, T., K. Chen, G. Corrado, and J. Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” (arXiv).

Nimark, K.P., and S. Pitschner. 2019. “News Media and Delegated Information Choice.” Journal of Economic Theory 181: 160–196.

Pennington, J., R. Socher, and C. Manning. 2014. “GloVe: Global Vectors for Word Representation.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha: Association for Computational Linguistics. 1532–1543.

Pitschner, S. 2020. “How Do Firms Set Prices? Narrative Evidence from Corporate Filings.” European Economic Review 124.

Porter, M.F. 1980. “An Algorithm for Suffix Stripping.” Program 14 (3): 130–137.

Romer, C.D., and D.H. Romer. 2004. “A New Measure of Monetary Shocks: Derivation and Implications.” American Economic Review 94: 1055–1084.

Salton, G., and M.J. and McGill. 1983. Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill.

Shapiro, A.H., and D. Wilson. 2019. “Taking the Fed at its Word: A New Approach to Estimating Central Bank Objectives using Text Analysis.” Federal Reserve Bank of San Francisco Working Paper 2019-02.

Shiller, R.J. 2017. “Narrative Economics.” American Economic Review 107 (4): 967–1004.

Tetlock, P. 2007. “Giving Content to Investor Sentiment: The Role of Media in the Stock Market. ” Journal of Finance 62 (3): 1139–1168.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.154.208