In Chapter 3, Search Using Beautiful Soup, we saw how to apply searching methods to search tags, texts, and more in an HTML document. Beautiful Soup does much more than just searching. Beautiful Soup can also be used to navigate through the HTML/XML document. Beautiful Soup comes with attributes to help in the case of navigation. We can find the same information up to some level using the searching methods, but in some cases due to the structure of the page, we have to combine both searching and navigation mechanisms to get the desired result. Navigation techniques come in handy in those cases. In this chapter, we will get into navigation using Beautiful Soup in detail.
Navigation in Beautiful Soup is almost the same as the searching methods. In navigating, instead of methods, there are certain attributes that facilitate the navigation. As we already saw in Chapter 2, Creating a BeautifulSoup Object, Beautiful Soup uses a different TreeBuilder
to build the HTML/XML tree. So each Tag
or NavigableString
object will be a member of the resulting tree with the Beautiful Soup object placed at the top and other objects as the nodes of the tree.
The following code snippet is an example for an HTML tree:
html_markup = """<div class="ecopyramid"> <ul id="producers"> <li class="producerlist"> <div class="name">plants</div> <div class="number">100000</div> </li> <li class="producerlist"> <div class="name">algae</div> <div class="number">100000</div> </li> </ul> </div>"""
For the previous code snippet, the following HTML tree is formed:
In the previous figure, we can see that Beautiful Soup is the root of the tree, the Tag
objects make up the different nodes of the tree, while NavigableString
objects make up the leaves of the tree.
Navigation in Beautiful Soup is intended to help us visit the nodes of this HTML/XML tree. From a particular node, it is possible to:
We will be using the previous html_markup
as an example to discuss the different navigations using Beautiful Soup.
Any object, such as Tag
or BeautifulSoup
, that has children can use this navigation. Navigating down can be achieved in two ways.
A BeautifulSoup
or a Tag
object can use the name of a child tag to navigate to it. Even if there are multiple child nodes with the same name, this method will navigate to the first instance only. For example, we can consider the BeautifulSoup
object on the ecological pyramid example discussed in the previous example.
soup = BeautifulSoup(html_markup,"lxml") producer_entries = soup.ul print(producer_entries)
In the previous code, by using soup.ul
, we navigate to the first entry of the ul
tag within the soup object's children.
This can also be done for Tag
objects by using the following code:
first_producer = producer_entries.li print(first_producer) #output <li class="="producerlist"> <div class="="name">plants</div> <div class="="number">100000</div> </li>
The previous code used navigation on the tag object, producer_entries
, to find the first entry of the <li>
tag. We can verify this from the output. But this cannot be used on a NavigableString
object, as it doesn't have any children.
producer_name = first_producer.div.string
Here we stored the NavigableString
plants in producer_name
. Trying to navigate down from producer_name
will result in an error.
producer_name.li
This will throw the following AttributeError
since NavigableString
can't have any child objects:
AttributeError: 'NavigableString' object has no attribute 'li'
Beautiful Soup stores children in predefined attributes. There are two types of children.
Descendants include all tags coming under Beautiful Soup.
Based on the previous categorization, there are the following different attributes for navigating to the children:
.contents
.children
.descendants
These attributes will be present in all Tag
objects and the BeautifulSoup
object that facilitates navigation to the children.
The children of a Tag
object or a BeautifulSoup
object are stored as a list in the attribute .contents
.
print(type(soup.contents)) #output <class 'list'>
From the output, we know that type
is a list that holds the children. In this case, the number of children of the BeautifulSoup
object can be understood from the following code snippet:
print len(soup.contents) #output 1
We can use any type of list navigation in Python on .contents
too. For example, we can print the name of all children using the following code:
for tag in soup.contents: print(tag.name) #output html
Now let us see that in the case of the Tag
object producer_entries
using the following code snippet:
for child in producer_entries.contents: print(child) #output <li class="producerlist"> <div class="name">plants</div> <div class="number">100000</div> </li> <li class="producerlist"> <div class="name">algae</div> <div class="number">100000</div>
The .children
attribute is almost the same as the .contents
attribute. But it is not a list like .contents
, instead it is a Python generator and we can iterate over this to get each child.
print type(soup.children) #output <class 'list_generator'>
We can iterate over .children
of the BeautifulSoup
object, and get the children as in the example code given as follows:
for tag in soup.children: print(tag.name) #output html
The .contents
and .children
attributes consider the immediate children only, that is, soup.contents
or soup.children
returned only the root HTML tag.
Navigation to all children of a particular object is possible using .descendants
.
print(len(list(soup.descendants))) #output 13
From the output, we can see that .descendants
gives 13
, whereas .contents
or .children
gave only 1
.
Now let us print all descendants in this case.
from bs4.element import NavigableString for tag in soup.descendants: if isinstance(tag, NavigableString): print(tag) else: print(tag.name) #output html body p ul li div plants div 100000 li div algae div 100000
Here we are iterating through all the descendants of the soup
object. Since NavigableString
doesn't have the .Name
attribute, we are checking it and printing the string itself in the previous code. But for a Tag
object, we just print the .name
attribute.
The output for the code is entirely different from the ones in which we used .contents
or .children
.
Getting text data within a particular tag is one of the common use cases in scraping. Beautiful Soup provides special attributes to navigate to the string contained within each Tag
object using the attributes .string
and .strings
.
If a tag has NavigableString
as the only child or if it has another tag that has a NavigableString
object as a child, we can navigate to NavigableString
using the .string
attribute. As we know, NavigableString
represents the text stored inside the tag; using .string
, we navigate to the text stored inside the tag.
first_producer = soup.div print(first_producer.string) #output plants
Even if there are multiple child objects comprising of string and other tags, we can still get the string of each child using the .strings
generator. In the previous example, we have the <li>
tag with two <div>
tags as children. These <div>
tags contain strings. We can get these strings from the parent <li>
tag using the .strings
generator, which is shown as follows:
for string in li.strings: print(string) #output plants 10000
Like navigating down to find children, Beautiful Soup allows users to find the parents of a particular Tag
/NavigableString
object. Navigating up is done using .parent
and .parents
.
From the first figure, we understand that all Tag
and NavigableString
objects have a parent. The parent of a particular Tag
object can be found using the attribute .parent
.
producer_entries = soup.ul print(producer_entries.parent) #output div
The .parent
attribute of the top most <html>/<xml>
tag is the BeautifulSoup
object itself.
html_tag = soup.html print(html_tag.parent.name) #output u'[document]'
Since the soup
object is at the root of the tree, it didn't have a parent. So .parent
on the soup
object will return None
.
print(soup.parent) #output None
The
.parents
attribute is a generator that holds parents of a particular Tag
/NavigableString
.
third_div = soup.find_all("div")[2]
In the previous code, we store the third <div>
entry, which is <div class="="name">algae</div>
in third_div
.
Using this we iterate through the parents of this tag.
for parent in third_div.parents: print(parent.name) #output li ul body html [document]
In the previous code, we navigate to the <li>
tag, which is the immediate parent object of third_div
, then to the <ul>
tag, which is the parent of the <li>
tag. Likewise, navigation to the html
tag and finally [document]
, which represents the soup
object, is done.
Apart from navigating through the content up and down the HTML tree, Beautiful Soup also provides navigation methods to find the siblings of an object. Navigating to the siblings is possible using .next_sibling
and .previous_sibling
.
In the producer list, we can get the sibling of the first producer plants
using the following code snippet:
soup = BeautifulSoup(html_markup) first_producer = soup.find("li") second_producer = first_producer.next_sibling second_producer_name = second_producer.div.string print(second_producer_name) #output u'algae'
Here second_producer
is reached by navigating to the next sibling from first_producer
, which represents the first <li>
tag within the page.
The .previous_sibling
attribute is used to navigate to the previous sibling. For finding the previous sibling in the previous example, we can use the following code snippet:
print(second_producer.previuos_sibling) #output <li class="producerlist"><div class="name">plants</div><div class="number">100000</div></li>
If a tag doesn't have a previous sibling, it will return None
, that is print(first_producer.previous_sibling)
will give us None
since there are no previous sibling for this tag.
We have next_siblings
and previous_siblings
generators to iterate over the next and previous siblings from a particular object.
for previous_sibling in second_producer.previous_siblings: print(previous_sibling.name)
The previous code snippet will give only the <li>
tag, which is the only previous sibling. The same iteration can be used for next_siblings
to find the siblings coming after an object.
We saw different ways of navigating to the children, siblings, and parents. Sometimes we may need to navigate to objects that may not be in direct relation with the tag such as the children, siblings, or parent. So, in order to find the immediate element that is parsed after, our object can be found using .next_element
.
For example, the immediate element parsed after the first <li>
tag is the <div>
tag.
first_producer = soup.li print(first_producer.next_element) #output <div class="name">plants</div>
The previous code prints the next element, which is <div class="name">plants</div>
.
In the same way, .previous_element
can be used to find the immediate element parsed before a particular tag or string.
second_div = soup.find_all("("div")[")[1] print(second_div.previous_element) #output plants
From the output, it is clear that the one parsed immediately before the second <div>
tag is the string plants
.
13.58.114.29