Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10. Working with Unstructured and Textual Data

In this chapter, we will cover the following recipes:

Tokenizing text
Finding sentences
Focusing on content words with stoplists
Getting document frequencies
Scaling document frequencies by document size
Scaling document frequencies with TF-IDF
Finding people, places, and things with Named Entity Recognition
Mapping documents to a sparse vector space representation
Performing topic modeling with MALLET
Performing naïve Bayesian classification with MALLET

Introduction

We've been talking about all of the data that's out there in the world. However, structured or semistructured data—the kind you'd find in spreadsheets or in tables on web pages—is vastly overshadowed by the unstructured data that's being produced. This includes news articles, blog posts, tweets, Hacker News discussions, StackOverflow questions and responses, and any other natural text that seems like it is being generated by the petabytes daily.

This unstructured content contains information. It has rich, subtle, and nuanced data, but getting it is difficult. In this chapter, we'll explore some ways to get some of the information out of unstructured data. It won't be fully nuanced and it will be very rough, but it's a start. We've already looked at how to acquire textual data. In Chapter 1, Importing Data for Analysis, we looked at this in the Scraping textual data from web pages recipe. Still, the Web is going to be your best source for data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10. Working with Unstructured and Textual Data

Create new playlist

Sign In

Sign Up

Chapter 10. Working with Unstructured and Textual Data

Introduction

Table of Contents for
10. Working with Unstructured and Textual Data