The Right Tool for the Job: Nokogiri

Often novices—and experts, too—approach the problem of extracting HTML with regular expressions. (If you’re not familiar with regular expressions, we’ll be covering them in detail in Part II.) This is a truly terrible idea, and one that generally leads to a level of pain that can make people avoid scraping HTML again—a great shame, since knowing how to extract information from web pages is of huge practical benefit. But the fact remains: attempting to parse HTML with regular expressions is an awful idea.

The right tool to use is an HTML parser, which will parse the document into a tree and allow you to search and manipulate the nodes within that tree. A few libraries are available, but one called Nokogiri[8] is the most popular among Rubyists, and with good reason: it’s fast, stable, and powerful.

An HTML parser works by taking the HTML that we give it and constructing a document tree from it. This is the same thing your web browser does when it reads a web page, and if you’ve ever used JavaScript you’ve likely worked with your browser’s representation of a document structure. It’s called the Document Object Model (DOM), and it makes it easy to target specific elements and modify them, or add new elements.

Nokogiri does exactly the same thing under the hood: given some HTML, it provides us with a DOM-like interface to that HTML, allowing us to read it and manipulate it. Just like in JavaScript, we can modify and remove elements, but since we’re looking at scraping we’re going to focus on searching through the document to match particular elements we’re interested in, and then extracting their content.

Nokogiri is packaged as a Ruby Gem, so installing it is as easy as:

 
$ ​gem​​ ​​install​​ ​​nokogiri

Now we just require "nokogiri", and we’re all set to begin parsing web pages.

If we require the open-uri library, part of the Ruby standard library, we can simply use open to fetch a web page, passing the result into Nokogiri’s HTML method to create a document object:

 
require ​"nokogiri"
 
require ​"open-uri"
 
 
doc = Nokogiri::HTML(open(​"http://example.com/some/page.html"​))

open-uri will take care of making the request for us. What it returns is an IO object containing the HTML from the web page. Nokogiri is happy with either a string containing HTML or an IO object, such as a file, so it will happily parse the result of open for us.

The next step is to actually do something with this document: we can search through it for elements that match particular criteria, for example, or to modify the structure. Let’s take a look at how.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.65.76