Chapter 10. Working with Unstructured and Textual Data

In this chapter, we will cover the following recipes:

  • Tokenizing text
  • Finding sentences
  • Focusing on content words with stoplists
  • Getting document frequencies
  • Scaling document frequencies by document size
  • Scaling document frequencies with TF-IDF
  • Finding people, places, and things with Named Entity Recognition
  • Mapping documents to a sparse vector space representation
  • Performing topic modeling with MALLET
  • Performing naïve Bayesian classification with MALLET

Introduction

We've been talking about all of the data that's out there in the world. However, structured or semistructured data—the kind you'd find in spreadsheets or in tables on web pages—is vastly overshadowed by the unstructured data that's being produced. This includes news articles, blog posts, tweets, Hacker News discussions, StackOverflow questions and responses, and any other natural text that seems like it is being generated by the petabytes daily.

This unstructured content contains information. It has rich, subtle, and nuanced data, but getting it is difficult. In this chapter, we'll explore some ways to get some of the information out of unstructured data. It won't be fully nuanced and it will be very rough, but it's a start. We've already looked at how to acquire textual data. In Chapter 1, Importing Data for Analysis, we looked at this in the Scraping textual data from web pages recipe. Still, the Web is going to be your best source for data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.14.132