Chapter 4. Navigation Using Beautiful Soup

In Chapter 3, Search Using Beautiful Soup, we saw how to apply searching methods to search tags, texts, and more in an HTML document. Beautiful Soup does much more than just searching. Beautiful Soup can also be used to navigate through the HTML/XML document. Beautiful Soup comes with attributes to help in the case of navigation. We can find the same information up to some level using the searching methods, but in some cases due to the structure of the page, we have to combine both searching and navigation mechanisms to get the desired result. Navigation techniques come in handy in those cases. In this chapter, we will get into navigation using Beautiful Soup in detail.

Navigation using Beautiful Soup

Navigation in Beautiful Soup is almost the same as the searching methods. In navigating, instead of methods, there are certain attributes that facilitate the navigation. As we already saw in Chapter 2, Creating a BeautifulSoup Object, Beautiful Soup uses a different TreeBuilder to build the HTML/XML tree. So each Tag or NavigableString object will be a member of the resulting tree with the Beautiful Soup object placed at the top and other objects as the nodes of the tree.

The following code snippet is an example for an HTML tree:

html_markup = """<div class="ecopyramid">
  <ul id="producers">
    <li class="producerlist">
      <div class="name">plants</div>
      <div class="number">100000</div>
    </li>
    <li class="producerlist">
      <div class="name">algae</div>
      <div class="number">100000</div>
    </li>
  </ul>
</div>"""

For the previous code snippet, the following HTML tree is formed:

Navigation using Beautiful Soup

In the previous figure, we can see that Beautiful Soup is the root of the tree, the Tag objects make up the different nodes of the tree, while NavigableString objects make up the leaves of the tree.

Navigation in Beautiful Soup is intended to help us visit the nodes of this HTML/XML tree. From a particular node, it is possible to:

  • Navigate down to the children
  • Navigate up to the parent
  • Navigate sideways to the siblings
  • Navigate to the next and previous objects parsed

We will be using the previous html_markup as an example to discuss the different navigations using Beautiful Soup.

Navigating down

Any object, such as Tag or BeautifulSoup, that has children can use this navigation. Navigating down can be achieved in two ways.

Using the name of the child tag

A BeautifulSoup or a Tag object can use the name of a child tag to navigate to it. Even if there are multiple child nodes with the same name, this method will navigate to the first instance only. For example, we can consider the BeautifulSoup object on the ecological pyramid example discussed in the previous example.

soup = BeautifulSoup(html_markup,"lxml")
producer_entries = soup.ul
print(producer_entries)

In the previous code, by using soup.ul, we navigate to the first entry of the ul tag within the soup object's children.

This can also be done for Tag objects by using the following code:

first_producer = producer_entries.li
print(first_producer)

#output
<li class="="producerlist">
  <div class="="name">plants</div>
  <div class="="number">100000</div>
</li>

The previous code used navigation on the tag object, producer_entries, to find the first entry of the <li> tag. We can verify this from the output. But this cannot be used on a NavigableString object, as it doesn't have any children.

producer_name = first_producer.div.string

Here we stored the NavigableString plants in producer_name. Trying to navigate down from producer_name will result in an error.

producer_name.li

This will throw the following AttributeError since NavigableString can't have any child objects:

AttributeError: 'NavigableString' object has no attribute 'li'

Using predefined attributes

Beautiful Soup stores children in predefined attributes. There are two types of children.

  • Direct children: These come immediately after a node in an HTML tree. For example, in the following figure, html is the direct child of BeautifulSoup.
    Using predefined attributes
  • Descendants: These contain all the children of a particular node including the direct child. For example, in the following image, we can see the direct child and the descendants of BeautifulSoup.
    Using predefined attributes

Descendants include all tags coming under Beautiful Soup.

Based on the previous categorization, there are the following different attributes for navigating to the children:

  • .contents
  • .children
  • .descendants

These attributes will be present in all Tag objects and the BeautifulSoup object that facilitates navigation to the children.

The .contents attribute

The children of a Tag object or a BeautifulSoup object are stored as a list in the attribute .contents.

print(type(soup.contents))
#output
<class 'list'> 

From the output, we know that type is a list that holds the children. In this case, the number of children of the BeautifulSoup object can be understood from the following code snippet:

print len(soup.contents)

#output
1 

We can use any type of list navigation in Python on .contents too. For example, we can print the name of all children using the following code:

for tag in soup.contents:
    print(tag.name)

#output
html

Now let us see that in the case of the Tag object producer_entries using the following code snippet:

for child in producer_entries.contents:
  print(child)
#output
<li class="producerlist">
  <div class="name">plants</div>
  <div class="number">100000</div>
</li>
<li class="producerlist">
  <div class="name">algae</div>
  <div class="number">100000</div>

The .children attribute

The .children attribute is almost the same as the .contents attribute. But it is not a list like .contents, instead it is a Python generator and we can iterate over this to get each child.

print type(soup.children)

#output
<class 'list_generator'>

We can iterate over .children of the BeautifulSoup object, and get the children as in the example code given as follows:

for tag in soup.children:
  print(tag.name)

#output
html 

The .descendants attribute

The .contents and .children attributes consider the immediate children only, that is, soup.contents or soup.children returned only the root HTML tag.

Navigation to all children of a particular object is possible using .descendants.

print(len(list(soup.descendants)))
#output
13

From the output, we can see that .descendants gives 13, whereas .contents or .children gave only 1.

Now let us print all descendants in this case.

from bs4.element import NavigableString
for tag in soup.descendants: 
  if isinstance(tag, NavigableString):
    print(tag)
  else:
    print(tag.name)

#output
html
body
p
ul
li
div
plants
div
100000
li
div
algae
div
100000

Here we are iterating through all the descendants of the soup object. Since NavigableString doesn't have the .Name attribute, we are checking it and printing the string itself in the previous code. But for a Tag object, we just print the .name attribute.

The output for the code is entirely different from the ones in which we used .contents or .children.

Special attributes for navigating down

Getting text data within a particular tag is one of the common use cases in scraping. Beautiful Soup provides special attributes to navigate to the string contained within each Tag object using the attributes .string and .strings.

The .string attribute

If a tag has NavigableString as the only child or if it has another tag that has a NavigableString object as a child, we can navigate to NavigableString using the .string attribute. As we know, NavigableString represents the text stored inside the tag; using .string, we navigate to the text stored inside the tag.

first_producer = soup.div
print(first_producer.string)

#output
plants

The .strings attribute

Even if there are multiple child objects comprising of string and other tags, we can still get the string of each child using the .strings generator. In the previous example, we have the <li> tag with two <div> tags as children. These <div> tags contain strings. We can get these strings from the parent <li> tag using the .strings generator, which is shown as follows:

for string in li.strings:
  print(string)
#output
plants
10000

Navigating up

Like navigating down to find children, Beautiful Soup allows users to find the parents of a particular Tag/NavigableString object. Navigating up is done using .parent and .parents.

The .parent attribute

From the first figure, we understand that all Tag and NavigableString objects have a parent. The parent of a particular Tag object can be found using the attribute .parent.

producer_entries = soup.ul
print(producer_entries.parent)

#output
div

The .parent attribute of the top most <html>/<xml> tag is the BeautifulSoup object itself.

html_tag = soup.html
print(html_tag.parent.name)

#output
u'[document]'

Since the soup object is at the root of the tree, it didn't have a parent. So .parent on the soup object will return None.

print(soup.parent)

#output
None

The .parents attribute

The .parents attribute is a generator that holds parents of a particular Tag/NavigableString.

third_div = soup.find_all("div")[2]

In the previous code, we store the third <div> entry, which is <div class="="name">algae</div> in third_div.

Using this we iterate through the parents of this tag.

for parent in third_div.parents:
  print(parent.name)
#output

li
ul
body
html
[document]

In the previous code, we navigate to the <li> tag, which is the immediate parent object of third_div, then to the <ul> tag, which is the parent of the <li> tag. Likewise, navigation to the html tag and finally [document], which represents the soup object, is done.

Navigating sideways to the siblings

Apart from navigating through the content up and down the HTML tree, Beautiful Soup also provides navigation methods to find the siblings of an object. Navigating to the siblings is possible using .next_sibling and .previous_sibling.

The .next_sibling attribute

In the producer list, we can get the sibling of the first producer plants using the following code snippet:

soup = BeautifulSoup(html_markup)
first_producer = soup.find("li")
second_producer = first_producer.next_sibling
second_producer_name = second_producer.div.string
print(second_producer_name)

#output
u'algae'

Here second_producer is reached by navigating to the next sibling from first_producer, which represents the first <li> tag within the page.

The .previous_sibling attribute

The .previous_sibling attribute is used to navigate to the previous sibling. For finding the previous sibling in the previous example, we can use the following code snippet:

print(second_producer.previuos_sibling)

#output
<li class="producerlist"><div class="name">plants</div><div class="number">100000</div></li>

If a tag doesn't have a previous sibling, it will return None, that is print(first_producer.previous_sibling) will give us None since there are no previous sibling for this tag.

We have next_siblings and previous_siblings generators to iterate over the next and previous siblings from a particular object.

for previous_sibling in second_producer.previous_siblings:
  print(previous_sibling.name)

The previous code snippet will give only the <li> tag, which is the only previous sibling. The same iteration can be used for next_siblings to find the siblings coming after an object.

Navigating to the previous and next objects parsed

We saw different ways of navigating to the children, siblings, and parents. Sometimes we may need to navigate to objects that may not be in direct relation with the tag such as the children, siblings, or parent. So, in order to find the immediate element that is parsed after, our object can be found using .next_element.

For example, the immediate element parsed after the first <li> tag is the <div> tag.

first_producer = soup.li
print(first_producer.next_element)

#output
<div class="name">plants</div>

The previous code prints the next element, which is <div class="name">plants</div>.

Tip

.next_element and .next_sibling are entirely different. .next_element points to the object that is parsed immediately after the current object, whereas .next_sibling points to the object that is at the same level in the tree.

In the same way, .previous_element can be used to find the immediate element parsed before a particular tag or string.

second_div = soup.find_all("("div")[")[1]
print(second_div.previous_element)

#output
plants

From the output, it is clear that the one parsed immediately before the second <div> tag is the string plants.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.114.29