We saw the creation of the BeautifulSoup object and other objects, such as Tag and NavigableString in Chapter 2, Creating a BeautifulSoup Object. The HTML/XML document is converted to these objects for the ease of searching and navigating within the document.
In this chapter, we will learn the different searching methods provided by Beautiful Soup to search based on tag name, attribute values of tag, text within the document, regular expression, and so on. At the end, we will make use of these searching methods to scrape data from an online web page.
Beautiful Soup helps in scraping information from web pages. Useful information is scattered across web pages as text or attribute values of different tags. In order to scrape such pages, it is necessary to search through the entire page for different tags based on the attribute values or tag name or texts within the document. To facilitate this, Beautiful Soup comes with inbuilt search methods listed as follows:
find()
find_all()
find_parent()
find_parents()
find_next_sibling()
find_next_siblings()
find_previous_sibling()
find_previous_siblings()
find_previous()
find_all_previous()
find_next()
find_all_next()
In this chapter, we will use the following HTML code for explaining the search using Beautiful Soup. We can save this as an HTML file named ecologicalpyramid.html
inside the Soup
directory we created in the previous chapter.
<html> <body> <div class="ecopyramid"> <ul id="producers"> <li class="producerlist"> <div class="name">plants</div> <div class="number">100000</div> </li> <li class="producerlist"> <div class="name">algae</div> <div class="number">100000</div> </li> </ul> <ul id="primaryconsumers"> <li class="primaryconsumerlist"> <div class="name">deer</div> <div class="number">1000</div> </li> <li class="primaryconsumerlist"> <div class="name">rabbit</div> <div class="number">2000</div> </li> </ul> <ul id="secondaryconsumers"> <li class="secondaryconsumerlist"> <div class="name">fox</div> <div class="number">100</div> </li> <li class="secondaryconsumerlist"> <div class="name">bear</div> <div class="number">100</div> </li> </ul> <ul id="tertiaryconsumers"> <li class="tertiaryconsumerlist"> <div class="name">lion</div> <div class="number">80</div> </li> <li class="tertiaryconsumerlist"> <div class="name">tiger</div> <div class="number">50</div> </li> </ul> </body> </html>
The preceding HTML is a simple representation of the ecological pyramid. To find the first producer, primary consumer, or secondary consumer, we can use Beautiful Soup search methods. In general, to find the first entry of any tag within a BeautifulSoup
object, we can use the find()
method.
In the case of the ecological pyramid example of the HTML content, we can easily recognize that the producers are within the first <ul>
tag. Since the producers come as the first entry for the <ul>
tag within the whole HTML document, it is easy to find the first producer using the find()
method. The HTML tree that represents the first producer is shown in the following diagram:
Now, we can change to the Soup
directory using the following command:
cd Soup.
We can save the following code as ecologicalpyramid.py
and use python ecologicalpyramid.py
to run it, or we can run the code from Python interpreter. Using the following code, we will create a BeautifulSoup
object using the ecologicalpyramid.html
file:
from bs4 import BeautifulSoup with open("ecologicalpyramid.html","r") as ecological_pyramid: soup = BeautifulSoup(ecological_pyramid,"lxml") producer_entries = soup.find("ul") print(producer_entries.li.div.string) #output plants
Since producers come as the first entry for the <ul>
tag, we can use the find()
method, which normally searches for only the first occurrence of a particular tag in a BeautifulSoup
object. We store this in producer_entries
. The next line prints the name of the first producer. From the previous HTML diagram, we can understand that the first producer is stored inside the first <div>
tag of the first <li>
tag that immediately follows the first <ul>
tag, as shown in the following code:
<ul id="producers"> <li class="producerlist"> <div class="name">plants</div> <div class="number">100000</div> </li> </ul>
So, after running the preceding code, we will get plants
, which is the first producer, as the output.
At this point, we know that find()
is used to search for the first occurrence of any items within a BeautifulSoup
object. The signature of the find()
method is as follows:
find(name,attrs,recursive,text,**kwargs)
As the signature implies, the find()
method accepts the parameters, such as name
, attrs
, recursive,text
, and **kwargs
. Parameters such as name
, attrs
, and text
are the filters that can be applied on a find()
method.
Different filters can be applied on find()
for the following cases:
name
parametertext
parameterattrs
parameterFinding the first producer was an example of a simple filter that can be done using the find()
method. We basically passed the ul
string, which represented the name of the <ul>
tag to the find()
method. Likewise, we can pass any tag name to the find()
method to get its first occurrence. In this case, find()
returns a Beautiful Soup Tag
object. For example, refer to the following code:
tag_li = soup.find("li") print(type(tag_li)) #output <class 'bs4.element.Tag'>
The preceding code finds the first occurrence of the li
tag within the HTML document and then prints the type of tag_li
.
This can also be achieved by passing the name
argument as follows:
tag_li = soup.find(name="li") print(type(tag_li)) #output <class 'bs4.element.Tag'>
By default, find()
returns the first Tag
object with name equals to the string we passed.
If we pass a string to search using the find()
method, it will search for tags with the given name by default. But, if we want to search only for text within the BeautifulSoup
object, we can use it as follows:
search_for_stringonly = soup.find(text="fox") #output fox
The preceding code will search for the occurrence of the fox
text within the ecological pyramid. Searching for the text using Beautiful Soup is case sensitive. For example, case_sensitive_string = soup.find(text="Fox")
will return None
.
The find()
method can search based on a regular expression. This normally comes in handy when we have an HTML page with no pattern like the preceding producer example.
Let us take an example where we are given a page with e-mail IDs, as mentioned in the following code, and we are asked to find the first e-mail ID:
<br/> <div>The below HTML has the information that has email ids.</div> [email protected] <div>[email protected]</div> <span>[email protected]</span>
Here, the e-mail IDs are scattered across the page with one inside the <div>
tag, another inside the <span>
tag, and the first one, which is not enclosed by any tag. It is difficult here to find the first e-mail ID. But if we can represent the e-mail ID using regular expression, find()
can search based on the expression to get the first e-mail ID.
So in this case, we just need to form the regular expression for the e-mail ID and pass it to the find()
method. The find()
method will use the regular expression, the match()
method, to find a match for the given regular expression.
Let us find the first e-mail ID using the following code:
import re from bs4 import BeautifulSoup email_id_example = """<br/> <div>The below HTML has the information that has email ids.</div> [email protected] <div>[email protected]</div> <span>[email protected]</span> """ soup = BeautifulSoup(email_id_example,"lxml") emailid_regexp = re.compile("w+@w+.w+") first_email_id = soup.find(text=emailid_regexp) print(first_email_id) #output [email protected]
In the preceding code, we created the regular expression for the e-mail ID in the emailid_regexp
variable. The pattern we used previously is w+@w+
. The w+
symbol represents one or more alphanumeric characters followed by @
, and then again followed by one or more alphanumeric character, then a .
symbol, and one or more alphanumeric character. This matches the e-mail ID in the preceding example. We then passed the emailid_regexp
variable to the find()
method to find the first text that matches the preceding pattern.
We can use find()
to search based on the attribute values of a tag. In the previous ecological pyramid example, we can see that primary consumers are within the <ul>
tag with the primaryconsumers
ID. In this case, it is easy to use find()
with the argument as the attribute value that we are looking for.
Finding the producer was an easy task, since it was the first entry for the <ul>
tag. But what about the first primary consumer? It is not inside the first <ul>
tag. By careful analysis, we can see that primary consumers are inside the second <ul>
tag with the id="primaryconsumers"
attribute. In this case, we can use Beautiful Soup to search based on the attribute value, assuming we already created the soup
object. In the following code, we are storing the first occurrence of the tag with the id =" primaryconsumers"
attribute in primary_consumers
:
primary_consumers = soup.find(id="primaryconsumers") print(primary_consumers.li.div.string) #output deer
If we analyze the HTML, we can see that the first primary consumer is stored as follows:
<ul id="primaryconsumers"> <li class="primaryconsumerlist"> <div class="name">deer</div> <div class="number">1000</div> </li> </ul>
We can see that the first primary consumer is stored inside the first <div>
tag of the first <li>
tag. The second line prints the string stored inside this <div>
tag, which is the first primary consumer name, which is deer
.
Searching based on attribute values will work for most of the attributes, such as id
, style
, and title
. But, there are some exceptions in the case of a couple of attributes as follows:
In these cases, although we can't go directly with the attribute-value-based search, we can use the attrs
argument that can be passed into the find()
method.
In HTML5, it is possible to add custom attributes such as data-custom
to a tag. If we want Beautiful Soup to search based on these attributes, it will not be possible to use it like we did in the case of the id
attribute.
For searching based on the id
attribute, we used the following code line:
soup.find(id="primaryconsumer")
But, if we use the attribute value the same way for the following HTML, the code will throw an error as keyword can't be an expression
:
customattr = ""'<p data-custom="custom">custom attribute example</p>""" customsoup = BeautifulSoup(customattr,'lxml') customSoup.find(data-custom="custom")
The error is thrown because Python variables cannot contain a -
character and the data-custom
variable that we passed contained a -
character.
In such cases, we need to pass in the keyword arguments as a dictionary in the attrs
argument as follows:
using_attrs = customsoup.find(attrs={'data-custom':'custom'}) print(using_attrs) #output '<p data-custom="custom">custom attribute example</p>'
The class
argument is a reserved keyword in Python and so, we will not be able to use the keyword argument class
. So in the case of the CSS
classes also, we can use the same process as we did for custom attributes, as shown in the following code:
css_class = soup.find(attrs={'class':'primaryconsumerlist'}) print(css_class) #output <li class="primaryconsumerlist"> <div class="name">deer</div> <div class="number">1000</div> </li>
Since searching based on class
is a common thing, Beautiful Soup has a special keyword argument that can be passed for matching the CSS
class. The keyword argument that we can use is class_
and since this is not a reserved keyword in Python, it won't throw an error.
Line 1:
css_class = soup.find(class_ = "primaryconsumers" )
Line 2:
css_class = soup.find(attrs={'class':'primaryconsumers'})
The preceding two code lines are same.
We can pass functions to the find()
method for searching based on the conditions defined within the function. The function should return a true
or false
value. The corresponding tag, as defined by the function, will be found by the BeautifulSoup
object.
Let's take an example of finding the secondary consumers using functions within the find()
method:
def is_secondary_consumers(tag): return tag.has_attr('id') and tag.get('id') == 'secondaryconsumers'
The function checks whether the tag has the id
attribute and if its value is secondaryconsumers
. If the two conditions are met, the function will return true
, and so, we will get the particular tag we were looking for in the following code:
secondary_consumer = soup.find(is_secondary_consumers) print(secondary_consumer.li.div.string) #output fox
We use the find()
method by passing the function that returns either true
or false
, and so, the tag for which the function returns true
is displayed, which in our case, corresponds to the first secondary consumer.
We saw how to search based on text, tag, attribute value, regular expression, and so on. Beautiful Soup also helps in searching based on the combination of any of these methods.
In the preceding example, we discussed searching based on the attribute value. It was easy since the attribute value was present on only one type of tag (for example, in id='secondaryconsumers"
, the value was present only on the <ul>
tag).
But, what if there were multiple tags with the same attribute value? For example, refer to the following code:
<p class="identical"> Example of p tag with class identical </p> <div class="identical"> Example of div tag with class identical </div>
Here, we have a div
tag and a p
tag with the same class attribute value "identical"
. In this case, if we want to search only for the div
tag with the class attribute's value = identical
, we can use a combination of search using a tag and attribute value within the find()
method.
Let us see how we can search based on the preceding combination:
identical_div= soup.find("div",'class'_='identical') print(identical_div) #output <div class="identical"> Example of div tag with class identical </div>
Similarly, we can have any combination of the searching methods.
The find()
method was used to find the first result within a particular search criteria that we applied on a BeautifulSoup
object. As the name implies, find_all()
will give us all the items matching the search criteria we defined. The different filters that we see in find()
can be used in the find_all()
method. In fact, these filters can be used in any searching methods, such as
find_parents()
and
find_siblings()
.
Let us consider an example of using find_all()
.
We saw how to find the first and second primary consumer. If we need to find all the tertiary consumers, we can't use find()
. In this case, find_all()
will become handy.
all_tertiaryconsumers = soup.find_all(class_="tertiaryconsumerslist")
The preceding code line finds all the tags with the = "tertiaryconsumerlist"
class. If given a type check on this variable, we can see that it is nothing but a list of tag objects as follows:
print(type(all_tertiaryconsumers)) #output <class 'list'>
We can iterate through this list to display all tertiary consumer names by using the following code:
for tertiaryconsumer in all_tertiaryconsumers: print(tertiaryconsumer.div.string) #output lion tiger
Like find()
, the find_all()
method also has a similar set of parameters with an extra parameter, limit
, as shown in the following code line:
find_all(name,attrs,recursive,text,limit,**kwargs)
The limit
parameter is used to specify a limit to the number of results that we get. For example, from the e-mail ID sample we saw, we can use find_all()
to get all the e-mail IDs. Refer to the following code:
email_ids = soup.find_all(text=emailid_regexp) print(email_ids) #output [u'[email protected]',u'[email protected]',u'[email protected]']
Here, if we pass limit
, it will limit the result set to the limit we impose, as shown in the following example:
email_ids_limited = soup.find_all(text=emailid_regexp,limit=2) print(email_ids_limited) #output [u'[email protected]',u'[email protected]']
From the output, we can see that the result is limited to two.
We can pass True
or False
values to find the methods. If we pass True
to find_all()
, it will return all tags in the soup
object. In the case of find()
, it will be the first tag within the object. The print(soup.find_all(True))
line of code will print out all the tags associated with the soup
object.
In the case of searching for text, passing True
will return all text within the document as follows:
all_texts = soup.find_all(text=True) print(all_texts) #output [u' ', u' ', u' ', u' ', u' ', u'plants', u' ', u'100000', u' ', u' ', u' ', u'algae', u' ', u'100000', u' ', u' ', u' ', u' ', u' ', u'deer', u' ', u'1000', u' ', u' ', u' ', u'rabbit', u' ', u'2000', u' ', u' ', u' ', u' ', u' ', u'fox', u' ', u'100', u' ', u' ', u' ', u'bear', u' ', u'100', u' ', u' ', u' ', u' ', u' ', u'lion', u' ', u'80', u' ', u' ', u' ', u'tiger', u' ', u'50', u' ', u' ', u' ', u' ', u' ']
The preceding output prints every text content within the soup object including the new-line characters too.
Also, in the case of text, we can pass a list of strings and find_all()
will find every string defined in the list:
all_texts_in_list = soup.find_all(text=["plants","algae"]) print(all_texts_in_list) #output [u'plants', u'algae']
This is same in the case of searching for the tags, attribute values of tag, custom attributes, and the CSS
class.
For finding all the div
and li
tags, we can use the following code line:
div_li_tags = soup.find_all(["div","li"])
Similarly, for finding tags with the producerlist
and primaryconsumerlist
classes, we can use the following code lines:
all_css_class = soup.find_all(class_=["producerlist","primaryconsumerlist"])
Both find()
and find_all()
search an object's descendants (that is, all children coming after it in the tree), their children, and so on. We can control this behavior by using the recursive
parameter. If recursive = False
, search happens only on an object's direct children.
For example, in the following code, search happens only at direct children for div
and li
tags. Since the direct child of the soup
object is html
, the following code will give an empty list:
div_li_tags = soup.find_all(["div","li"],recursive=False) print(div_li_tags) #output []
If find_all()
can't find results, it will return an empty list, whereas find()
returns None
.
Searching for contents within an HTML page is easy using the find()
and find_all()
methods. During complex web scraping projects, it will be very easy for a user to visit the subsequent tags or the parent tags for extra information.
These tags that we intend to visit will be in a direct relationship with the tag we already searched. For example, we may need to find the immediate parent tag of a particular tag. Also, there will be situations to find the previous tag, next tag, tags that are in the same level (siblings), and so on. In these cases, there are searching methods provided within the BeautifulSoup
object, for example, find_parents()
, find_next_siblings()
, and so on. Normally, we use these methods followed by a find()
or find_all()
method since these methods are used for finding one particular tag and we are interested in finding the other tags, which are in relation with this tag.
We can find the parent tags associated with a tag by using the find_parents()
or find_parent()
methods. Likewise, find_all()
and find()
differ only in the number of results they return. The find_parents()
method returns the entire matching parent tags, whereas find_parent()
returns the first immediate parent. The searching methods that we discussed in find()
can be used for find_parent()
and find_parents()
.
In the primary consumer example, we can find the immediate parent tag associated with primaryconsumer
as follows:
primaryconsumers = soup.find_all(class_="primaryconsumerlist") primaryconsumer = primaryconsumers[0] parent_ul = primaryconsumer.find_parents('ul') print(parent_ul) #output <ul id="primaryconsumers"> <li class="primaryconsumerlist"> <div class="name">deer</div> <div class="number">1000</div> </li> <li class="primaryconsumerlist"> <div class="name">rabbit</div> <div class="number">2000</div> </li> </ul>
The first line will store all the primary consumers in the primaryconsumers
variable. We take the first entry and store that in primaryconsumer
. We are trying to find all the <ul>
tags, which are the parent tags of the primaryconsumer
list that we got. From the preceding output, we can understand that the result will contain the whole structure of the parent, which will also include the tag for which we found the parent. We can use the find_parent()
method to find the immediate parent of the tag.
parent_p = primaryconsumer.find_parent("p")
The preceding code will search for the first <p>
tag, which is the immediate parent of the <li>
tag with the primaryconsumerlist
class in the ecological pyramid example.
An easy way to get the immediate parent tag for a particular tag is to use the find_parent()
method without any parameter as follows:
immediateprimary_consumer_parent = primary_consumer.find_parent()
The result will be the same as primary_consumer.find_parent('ul')
since ul
is the immediate parent tag.
In a particular html
document, we can say that particular tags are siblings, if they are at the same level. For example, in the ecological pyramid, all of the <ul>
tags are on the same level and they are siblings if we define a relationship. We can understand this from the following diagram, which is a representation of the relationship between the first div
tag and the ul
tags:
This means that in our example, the ul
tags with the classes producers
, primaryconsumers
, secondaryconsumers
, and teritiaryconsumers
are siblings.
Also, in the following diagram for producers, we can see that the plants
value, which is the first producer and algae
, which is the second producer, cannot be treated as siblings, since they are not at the same level:
But, both div
with the value plants
and the value for number 10000
can be considered as siblings, as they are at the same level.
Beautiful Soup comes with methods to help us find the siblings too.
The find_next_siblings()
method allows to find the next siblings, whereas find_next_sibling()
allows to find the next sibling. In the following example, we can find out the siblings of the producers:
producers= soup.find(id='producers') next_siblings = producers.find_next_siblings() print(next_siblings) #output [<ul id="primaryconsumers"> <li class="primaryconsumerlist"> <div class="name">deer</div> <div class="number">1000</div> </li> <li class="primaryconsumerlist"> <div class="name">rabbit</div> <div class="number">2000</div> </li> </ul>, <ul id="secondaryconsumers"> <li class="secondaryconsumerlist"> <div class="name">fox</div> <div class="number">100</div> </li> <li class="secondaryconsumerlist"> <div class="name">bear</div> <div class="number">100</div> </li> </ul>, <ul id="tertiaryconsumers"> <li class="tertiaryconsumerlist"> <div class="name">lion</div> <div class="number">80</div> </li> <li class="tertiaryconsumerlist"> <div class="name">tiger</div> <div class="number">50</div> </li> </ul>]
So, we find the next siblings for the producer, which are the primary consumers, secondary consumers, and tertiary consumers.
We can use find_previous_siblings()
and find_previous_sibling()
to find the previous siblings and previous sibling respectively.
Like other find methods, we can have the different filters such as text, regular expression, attribute value, and tag name to pass to these methods to find the siblings accordingly.
For every tag, there will be a next element that can either be a navigable string, Tag
object, or any other BeautifulSoup
object. By next element, we mean the element that is parsed immediately after the current element. This is different from a sibling. We have methods to find the next objects for a particular Tag
object. The find_all_next()
method helps to find all the objects coming after the tag and find_next()
finds the first object that comes after the Tag
object. In this method, we can also use the different search methods as in find()
.
For example, we can find all the li
tags that come after the first div
tag using the following code:
first_div = soup.div all_li_tags = first_div.find_all_next("li") #output [<li class="producerlist"> <div class="name">plants</div> <div class="number">100000</div> </li>, <li class="producerlist"> <div class="name">algae</div> <div class="number">100000</div> </li>, <li class="primaryconsumerlist"> <div class="name">deer</div> <div class="number">1000</div> </li>, <li class="primaryconsumerlist"> <div class="name">rabbit</div> <div class="number">2000</div> </li>, <li class="secondaryconsumerlist"> <div class="name">fox</div> <div class="number">100</div> </li>, <li class="secondaryconsumerlist"> <div class="name">bear</div> <div class="number">100</div> </li>, <li class="tertiaryconsumerlist"> <div class="name">lion</div> <div class="number">80</div> </li>, <li class="tertiaryconsumerlist"> <div class="name">tiger</div> <div class="number">50</div> </li>]
Searching for previous
is the opposite case of next
, where we can find the previous object associated with a particular object. We can use the find_all_previous()
method to find all the previous objects associated with the current object and find_previous()
to find the previous object associated with the current object.
18.221.172.50