Each filing is a separate text file and a master index contains filing metadata. We extract the most informative sections, namely, the following:
- Items 1 and 1A: Business and Risk Factors
- Items 7 and 7A: Management's Discussion and Disclosures about Market Risks
The notebook preprocessing shows how to parse and tokenize the text using spaCy, similar to the approach taken in Chapter 14, Topic Modeling. We do not lemmatize the tokens to preserve the nuances of word usage.