Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. Parsing Specific Data

In this chapter, we will cover:

Parsing dates and times with Dateutil
Time zone lookup and conversion
Tagging temporal expressions with Timex
Extracting URLs from HTML with lxml
Cleaning and stripping HTML
Converting HTML entities with BeautifulSoup
Detecting and converting character encodings

Introduction

This chapter covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Luckily, there are a number of useful libraries for accomplishing this, so we don't have to delve into tricky and overly complicated regular expressions. These libraries can be great complements to the NLTK:

dateutil: Provides date/time parsing and time zone conversion
timex: Can identify time words in text
lxml and BeautifulSoup: Can parse, clean, and convert HTML
chardet: Detects the character encoding of text

The libraries can be useful for pre-processing text before passing it to an NLTK object, or post-processing text that has been processed and extracted using NLTK. Here's an example that ties many of these tools together.

Let's say you need to parse a blog article about a restaurant. You can use lxml or BeautifulSoup to extract the article text, outbound links, and the date and time when the article was written. The date and time can then be parsed to a Python datetime object with dateutil. Once you have the article text, you can use chardet to ensure it's UTF-8 before cleaning out the HTML and running it through NLTK-based part-of-speech tagging, chunk extraction, and/or text classification, to create additional metadata about the article. If there's an event happening at the restaurant, you may be able to discover that by looking at the time words identified by timex. The point of this example is that real-world text processing often requires more than just NLTK-based natural language processing, and the functionality covered in this chapter can help with those additional requirements.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. Parsing Specific Data

Create new playlist

Sign In

Sign Up

Chapter 9. Parsing Specific Data

Introduction

Table of Contents for
9. Parsing Specific Data