Working with Elements

Once we’ve selected the elements that are appropriate for the task at hand, using either XPath or CSS selectors, our job is most likely only half done. We typically want to actually do something with the elements we’ve matched.

There are two types of things we might want to do with the elements. The first set of tasks involves extracting information—either the text of elements within the set or the contents of attributes on those elements. Exactly where the information is stored depends both on the elements we’re dealing with and what we’re trying to do. If we’ve matched an image, the interesting part might be the src attribute, specifying the URL of the image; if we’ve matched a paragraph, then we’re likely to be interested in the text within the element.

The second set of tasks involves navigating the document from the position we’ve reached. Since the document is a tree, we can navigate up, down, and across it from a given position. This is useful when we want to do something with all of a particular element’s children, for example, or to go up a level to a parent and then proceed again.

Let’s look at the various methods on Nodes and NodeSets, and how we can use them to achieve these tasks.

Extracting Information from Elements

There are generally three pieces of information to extract from an element: its own text contents, the contents of an attribute, and the name of the element. There are three useful methods provided on Nokogiri nodes for extracting these three types of information.

Reading an Element’s Text

To extract the text of an element, we can use the text method. So to extract all of the level-two headings in a document and print their text, we could use the following Nokogiri code:

nokogiri/h2-text.rb
 
require ​"nokogiri"
 
 
doc = Nokogiri::HTML(<<-DOC)
 
<html>
 
<body>
 
<h2>This is a heading</h2>
 
<p>This is a paragraph</p>
 
 
<h2>This is also a heading</h2>
 
<p>This is also a paragraph</p>
 
</body>
 
</html>
 
DOC
 
 
doc.css(​"h2"​).each ​do​ |heading|
 
heading.text
 
# => "This is a heading"
 
# , "This is also a heading"
 
end

However, this might not behave exactly as we expect: it actually selects the text of this node and all of its descendants. For something like an h2, which typically has no descendants apart from the text content of the heading, this works fine. But if we ran it on something like a div, for example—something with many descendants—then it might produce unexpected results. For example, suppose we used the same document but asked for the text of the body:

nokogiri/body-text.rb
 
require ​"nokogiri"
 
 
doc = Nokogiri::HTML(<<-DOC)
 
<html>
 
<body>
 
Some body text
 
 
<p>Some paragraph text</p>
 
</body>
 
</html>
 
DOC
 
 
doc.at_css(​"body"​).text
 
# => " Some body text Some paragraph text "

We can see that we get the text not just of the body tag itself, but of the paragraph, too. It’s as though we stripped all of the HTML tags and are left with what remains. If we’re interested only in the text of the current node, though, we need to target not the node but a child of it: the text() node. To revise the previous example, then:

nokogiri/body-text-node.rb
 
doc.at_css(​"body text()"​).text
 
# => " Some body text "

Rather than request the text of the body itself, we select the text node within the body. That gets us the results we’re looking for.

We can target this text node using either XPath or CSS selectors; the previous example in XPath would simply be body/text().

Reading an Element’s Attributes

Reading attributes is also achieved by calling a method on the node. In this case, that method is attr. So to output the source URL of every image in a document, we could use the following:

 
doc.css(​"img"​).each ​do​ |image|
 
puts image.attr(​"src"​)
 
end

Nokogiri provides a shortcut to attr, since it’s something that’s used so frequently: we can use the subscript operator ([]) on a node, passing it the name of the attribute. So we could rewrite the last example as:

 
doc.css(​"img"​).each ​do​ |image|
 
puts image[​"src"​]
 
end

Reading an Element’s Name

Rounding out the three most useful methods is name, which allows us to check what type of element we’re dealing with. In the previous example, then:

 
doc.css(​"img"​).each ​do​ |image|
 
image.name
 
# => "img"
 
end

We can see we get back img, since it’s img elements that we’ve asked for in our selector. This might be useful when our selector matches multiple types of elements and we want to do something slightly different for each—extracting the src attribute of a script tag but the href tag of a stylesheet link, for example.

Between these three methods, we can generally extract any information in a web page. After all, if it’s not the content of an element, stored in an attribute of an element, or reflected in the name of the element itself, there are few other places it could be on a page.

Navigating from Nodes

We don’t just use nodes for extracting information; we can also use them to navigate through our document. Imagine we’ve targeted items in a list, where we might want to get a reference to the list itself. Or we could imagine the reverse situation, where we have a reference to the list itself and we want to do something with its children.

In both cases, we want to navigate the hierarchy of the document. In the former case, we want to navigate upward, to an element’s parent; and in the latter case we want to navigate downward, to an element’s children. Here’s an example of doing both:

nokogiri/hierarchy.rb
 
require ​"nokogiri"
 
 
doc = Nokogiri::HTML(<<-DOC)
 
<html>
 
<body>
 
<ul><li>List item one</li>
 
<li>List item two</li></ul>
 
</body>
 
</html>
 
DOC
 
 
list = doc.at_css(​"ul"​)
 
 
list_item = list.children.first
 
list_item.name ​# => "li"
 
 
list_item.parent.name ​# => "ul"

First we select the ul. Then we access the first list item by requesting the children of the list and then taking the first child. Then we return to the list itself by requesting the parent of the list item.

We can move arbitrarily up and down the hierarchy as we please using just these two methods. That’s the nature of the tree structure that all HTML documents fundamentally are. But we can move across the hierarchy, too. Just as nodes have parents and possibly also children, we can speak of them having siblings too.

Revisiting the list example:

nokogiri/siblings.rb
 
require ​"nokogiri"
 
 
doc = Nokogiri::HTML(<<-DOC)
 
<html>
 
<body>
 
<ul><li>List item one</li><li>List item two</li></ul>
 
</body>
 
</html>
 
DOC
 
 
first_li = doc.at_css(​"li"​)
 
 
second_li = first_li.next_sibling
 
second_li.text
 
# => "List item two"
 
 
first_li = second_li.previous_sibling
 
first_li.text
 
# => "List item one"

Again, we’re able to arbitrarily and repeatedly move between elements, this time by using the next_sibling and previous_sibling methods. This can come in handy if we’re interested in one element in particular, but we want to alter our behavior based on what’s around it. For example, we might want to extract captions from images, but only where they existed:

nokogiri/figcaption.rb
 
require ​"nokogiri"
 
 
doc = Nokogiri::HTML(<<-DOC)
 
<html>
 
<body>
 
<figure>
 
<img src="example.jpg"><figcaption>This image has a caption</figcaption>
 
</figure>
 
 
<figure>
 
<img src="example-2.jpg">
 
</figure>
 
</body>
 
</html>
 
DOC
 
 
Image = Struct.new(​:file​, ​:caption​)
 
 
doc.css(​"img"​).each ​do​ |img|
 
file = img[​"src"​]
 
 
caption = ​if​ img.next_sibling.name == ​"figcaption"
 
img.next_sibling.text
 
else
 
"No caption"
 
end
 
 
Image.new(file, caption)
 
# => #<struct Image file="example.jpg", caption="This image has a caption">
 
# , #<struct Image file="example-2.jpg", caption="No caption">
 
end

To recap, we now know how to search through an HTML document for the tags we’re looking for, then extract information from those tags. That’s all the elements of Nokogiri that we need to know in order to scrape web pages; we have enough tools to extract all the information we could want. But we’ve learned how to find something that we’re looking for specifically, not necessarily how to discover what that specific thing might be.

There’s a first step missing here: we know how to write selectors to target elements, but we don’t know how to figure out what selectors we need to write or what logic needs to surround them. Luckily, we can use Nokogiri for this exploratory step, too.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.155.185