Chapter 3. Search Using Beautiful Soup

We saw the creation of the BeautifulSoup object and other objects, such as Tag and NavigableString in Chapter 2, Creating a BeautifulSoup Object. The HTML/XML document is converted to these objects for the ease of searching and navigating within the document.

In this chapter, we will learn the different searching methods provided by Beautiful Soup to search based on tag name, attribute values of tag, text within the document, regular expression, and so on. At the end, we will make use of these searching methods to scrape data from an online web page.

Searching in Beautiful Soup

Beautiful Soup helps in scraping information from web pages. Useful information is scattered across web pages as text or attribute values of different tags. In order to scrape such pages, it is necessary to search through the entire page for different tags based on the attribute values or tag name or texts within the document. To facilitate this, Beautiful Soup comes with inbuilt search methods listed as follows:

  • find()
  • find_all()
  • find_parent()
  • find_parents()
  • find_next_sibling()
  • find_next_siblings()
  • find_previous_sibling()
  • find_previous_siblings()
  • find_previous()
  • find_all_previous()
  • find_next()
  • find_all_next()

Searching with find()

In this chapter, we will use the following HTML code for explaining the search using Beautiful Soup. We can save this as an HTML file named ecologicalpyramid.html inside the Soup directory we created in the previous chapter.

<html>
  <body>
  <div class="ecopyramid">
    <ul id="producers">
      <li class="producerlist">
        <div class="name">plants</div>
        <div class="number">100000</div>
      </li>
      <li class="producerlist">
        <div class="name">algae</div>
        <div class="number">100000</div>
      </li>
    </ul>
    <ul id="primaryconsumers">
      <li class="primaryconsumerlist">
        <div class="name">deer</div>
        <div class="number">1000</div>
      </li>
      <li class="primaryconsumerlist">
        <div class="name">rabbit</div>
        <div class="number">2000</div>
      </li>
    </ul>
    <ul id="secondaryconsumers">
      <li class="secondaryconsumerlist">
        <div class="name">fox</div>
        <div class="number">100</div>
      </li>
      <li class="secondaryconsumerlist">
        <div class="name">bear</div>
        <div class="number">100</div>
      </li>
    </ul>
    <ul id="tertiaryconsumers">
      <li class="tertiaryconsumerlist">
        <div class="name">lion</div>
        <div class="number">80</div>
      </li>
      <li class="tertiaryconsumerlist">
        <div class="name">tiger</div>
        <div class="number">50</div>
      </li>
    </ul>
  </body>
</html>

The preceding HTML is a simple representation of the ecological pyramid. To find the first producer, primary consumer, or secondary consumer, we can use Beautiful Soup search methods. In general, to find the first entry of any tag within a BeautifulSoup object, we can use the find() method.

Finding the first producer

In the case of the ecological pyramid example of the HTML content, we can easily recognize that the producers are within the first <ul> tag. Since the producers come as the first entry for the <ul> tag within the whole HTML document, it is easy to find the first producer using the find() method. The HTML tree that represents the first producer is shown in the following diagram:

Finding the first producer

Now, we can change to the Soup directory using the following command:

cd Soup.

We can save the following code as ecologicalpyramid.py and use python ecologicalpyramid.py to run it, or we can run the code from Python interpreter. Using the following code, we will create a BeautifulSoup object using the ecologicalpyramid.html file:

 
from bs4 import BeautifulSoup
with open("ecologicalpyramid.html","r") as ecological_pyramid:
     soup = BeautifulSoup(ecological_pyramid,"lxml")
producer_entries = soup.find("ul")
print(producer_entries.li.div.string)

#output
plants

Since producers come as the first entry for the <ul> tag, we can use the find() method, which normally searches for only the first occurrence of a particular tag in a BeautifulSoup object. We store this in producer_entries. The next line prints the name of the first producer. From the previous HTML diagram, we can understand that the first producer is stored inside the first <div> tag of the first <li> tag that immediately follows the first <ul> tag, as shown in the following code:

<ul id="producers">
  <li class="producerlist">
    <div class="name">plants</div>
    <div class="number">100000</div>
  </li>
</ul>

So, after running the preceding code, we will get plants, which is the first producer, as the output.

Explaining find()

At this point, we know that find() is used to search for the first occurrence of any items within a BeautifulSoup object. The signature of the find() method is as follows:

find(name,attrs,recursive,text,**kwargs)

As the signature implies, the find() method accepts the parameters, such as name, attrs, recursive,text, and **kwargs. Parameters such as name, attrs, and text are the filters that can be applied on a find() method.

Different filters can be applied on find() for the following cases:

  • Searching a tag, which corresponds to filtering based on the name parameter
  • Searching text, which corresponds to the filtering based on the text parameter
  • Searching based on a regular expression
  • Searching based on attribute values of a tag, which corresponds to the filtering based on the attrs parameter
  • Searching based on functions

Searching for tags

Finding the first producer was an example of a simple filter that can be done using the find() method. We basically passed the ul string, which represented the name of the <ul> tag to the find() method. Likewise, we can pass any tag name to the find() method to get its first occurrence. In this case, find() returns a Beautiful Soup Tag object. For example, refer to the following code:

tag_li = soup.find("li")
print(type(tag_li))

#output
<class 'bs4.element.Tag'>

The preceding code finds the first occurrence of the li tag within the HTML document and then prints the type of tag_li.

This can also be achieved by passing the name argument as follows:

tag_li = soup.find(name="li")
print(type(tag_li))

#output
<class 'bs4.element.Tag'>

By default, find() returns the first Tag object with name equals to the string we passed.

Searching for text

If we pass a string to search using the find() method, it will search for tags with the given name by default. But, if we want to search only for text within the BeautifulSoup object, we can use it as follows:

search_for_stringonly = soup.find(text="fox")

#output 
fox

The preceding code will search for the occurrence of the fox text within the ecological pyramid. Searching for the text using Beautiful Soup is case sensitive. For example, case_sensitive_string = soup.find(text="Fox") will return None.

Searching based on regular expressions

The find() method can search based on a regular expression. This normally comes in handy when we have an HTML page with no pattern like the preceding producer example.

Let us take an example where we are given a page with e-mail IDs, as mentioned in the following code, and we are asked to find the first e-mail ID:

<br/>
<div>The below HTML has the information that has email ids.</div> 
  [email protected]
<div>[email protected]</div>
<span>[email protected]</span>

Here, the e-mail IDs are scattered across the page with one inside the <div> tag, another inside the <span> tag, and the first one, which is not enclosed by any tag. It is difficult here to find the first e-mail ID. But if we can represent the e-mail ID using regular expression, find() can search based on the expression to get the first e-mail ID.

So in this case, we just need to form the regular expression for the e-mail ID and pass it to the find() method. The find() method will use the regular expression, the match() method, to find a match for the given regular expression.

Let us find the first e-mail ID using the following code:

import re
from bs4 import BeautifulSoup
email_id_example = """<br/>
<div>The below HTML has the information that has email ids.</div> 
  [email protected]
<div>[email protected]</div>
<span>[email protected]</span>
 """
soup = BeautifulSoup(email_id_example,"lxml")
emailid_regexp = re.compile("w+@w+.w+")
first_email_id = soup.find(text=emailid_regexp)
print(first_email_id)

#output
[email protected]

In the preceding code, we created the regular expression for the e-mail ID in the emailid_regexp variable. The pattern we used previously is w+@w+. The w+ symbol represents one or more alphanumeric characters followed by @, and then again followed by one or more alphanumeric character, then a . symbol, and one or more alphanumeric character. This matches the e-mail ID in the preceding example. We then passed the emailid_regexp variable to the find() method to find the first text that matches the preceding pattern.

Searching based on attribute values of a tag

We can use find() to search based on the attribute values of a tag. In the previous ecological pyramid example, we can see that primary consumers are within the <ul> tag with the primaryconsumers ID. In this case, it is easy to use find() with the argument as the attribute value that we are looking for.

Finding the first primary consumer

Finding the producer was an easy task, since it was the first entry for the <ul> tag. But what about the first primary consumer? It is not inside the first <ul> tag. By careful analysis, we can see that primary consumers are inside the second <ul> tag with the id="primaryconsumers" attribute. In this case, we can use Beautiful Soup to search based on the attribute value, assuming we already created the soup object. In the following code, we are storing the first occurrence of the tag with the id =" primaryconsumers" attribute in primary_consumers:

primary_consumers = soup.find(id="primaryconsumers")
print(primary_consumers.li.div.string)

#output
deer

If we analyze the HTML, we can see that the first primary consumer is stored as follows:

<ul id="primaryconsumers">
  <li class="primaryconsumerlist">
    <div class="name">deer</div>
    <div class="number">1000</div>
  </li>
</ul>

We can see that the first primary consumer is stored inside the first <div> tag of the first <li> tag. The second line prints the string stored inside this <div> tag, which is the first primary consumer name, which is deer.

Searching based on attribute values will work for most of the attributes, such as id, style, and title. But, there are some exceptions in the case of a couple of attributes as follows:

  • Custom attributes
  • Class

In these cases, although we can't go directly with the attribute-value-based search, we can use the attrs argument that can be passed into the find() method.

Searching based on custom attributes

In HTML5, it is possible to add custom attributes such as data-custom to a tag. If we want Beautiful Soup to search based on these attributes, it will not be possible to use it like we did in the case of the id attribute.

For searching based on the id attribute, we used the following code line:

soup.find(id="primaryconsumer")

But, if we use the attribute value the same way for the following HTML, the code will throw an error as keyword can't be an expression:

customattr = ""'<p data-custom="custom">custom attribute example</p>"""
customsoup = BeautifulSoup(customattr,'lxml')
customSoup.find(data-custom="custom")

The error is thrown because Python variables cannot contain a - character and the data-custom variable that we passed contained a - character.

In such cases, we need to pass in the keyword arguments as a dictionary in the attrs argument as follows:

using_attrs = customsoup.find(attrs={'data-custom':'custom'})
print(using_attrs)

#output
'<p data-custom="custom">custom attribute example</p>'
Searching based on the CSS class

The class argument is a reserved keyword in Python and so, we will not be able to use the keyword argument class. So in the case of the CSS classes also, we can use the same process as we did for custom attributes, as shown in the following code:

css_class = soup.find(attrs={'class':'primaryconsumerlist'})
print(css_class)

#output
<li class="primaryconsumerlist">
  <div class="name">deer</div>
  <div class="number">1000</div>
</li>

Since searching based on class is a common thing, Beautiful Soup has a special keyword argument that can be passed for matching the CSS class. The keyword argument that we can use is class_ and since this is not a reserved keyword in Python, it won't throw an error.

Line 1:

css_class = soup.find(class_ = "primaryconsumers" ) 

Line 2:

css_class = soup.find(attrs={'class':'primaryconsumers'})

The preceding two code lines are same.

Searching using functions defined

We can pass functions to the find() method for searching based on the conditions defined within the function. The function should return a true or false value. The corresponding tag, as defined by the function, will be found by the BeautifulSoup object.

Let's take an example of finding the secondary consumers using functions within the find() method:

def is_secondary_consumers(tag):
  return tag.has_attr('id') and tag.get('id') == 'secondaryconsumers'

The function checks whether the tag has the id attribute and if its value is secondaryconsumers. If the two conditions are met, the function will return true, and so, we will get the particular tag we were looking for in the following code:

secondary_consumer =  soup.find(is_secondary_consumers)
print(secondary_consumer.li.div.string)

#output
fox

We use the find() method by passing the function that returns either true or false, and so, the tag for which the function returns true is displayed, which in our case, corresponds to the first secondary consumer.

Applying searching methods in combination

We saw how to search based on text, tag, attribute value, regular expression, and so on. Beautiful Soup also helps in searching based on the combination of any of these methods.

In the preceding example, we discussed searching based on the attribute value. It was easy since the attribute value was present on only one type of tag (for example, in id='secondaryconsumers", the value was present only on the <ul> tag).

But, what if there were multiple tags with the same attribute value? For example, refer to the following code:

<p class="identical">
  Example of p tag with class identical
</p>
<div class="identical">
  Example of div tag with class identical
</div>

Here, we have a div tag and a p tag with the same class attribute value "identical". In this case, if we want to search only for the div tag with the class attribute's value = identical, we can use a combination of search using a tag and attribute value within the find() method.

Let us see how we can search based on the preceding combination:

identical_div= soup.find("div",'class'_='identical')
print(identical_div)

#output
<div class="identical">
  Example of div tag with class identical
</div>

Similarly, we can have any combination of the searching methods.

Searching with find_all()

The find() method was used to find the first result within a particular search criteria that we applied on a BeautifulSoup object. As the name implies, find_all() will give us all the items matching the search criteria we defined. The different filters that we see in find() can be used in the find_all() method. In fact, these filters can be used in any searching methods, such as find_parents() and find_siblings().

Let us consider an example of using find_all().

Finding all tertiary consumers

We saw how to find the first and second primary consumer. If we need to find all the tertiary consumers, we can't use find(). In this case, find_all() will become handy.

all_tertiaryconsumers = soup.find_all(class_="tertiaryconsumerslist")

The preceding code line finds all the tags with the = "tertiaryconsumerlist" class. If given a type check on this variable, we can see that it is nothing but a list of tag objects as follows:

print(type(all_tertiaryconsumers))
  
#output
<class 'list'>

We can iterate through this list to display all tertiary consumer names by using the following code:

for tertiaryconsumer in all_tertiaryconsumers:
  print(tertiaryconsumer.div.string)
#output
lion
tiger

Understanding parameters used with find_all()

Like find(), the find_all() method also has a similar set of parameters with an extra parameter, limit, as shown in the following code line:

find_all(name,attrs,recursive,text,limit,**kwargs)

The limit parameter is used to specify a limit to the number of results that we get. For example, from the e-mail ID sample we saw, we can use find_all() to get all the e-mail IDs. Refer to the following code:

email_ids = soup.find_all(text=emailid_regexp)
print(email_ids)

#output
[u'[email protected]',u'[email protected]',u'[email protected]']

Here, if we pass limit, it will limit the result set to the limit we impose, as shown in the following example:

email_ids_limited = soup.find_all(text=emailid_regexp,limit=2)
print(email_ids_limited)

#output
[u'[email protected]',u'[email protected]']

From the output, we can see that the result is limited to two.

Tip

The find() method is find_all() with limit=1.

We can pass True or False values to find the methods. If we pass True to find_all(), it will return all tags in the soup object. In the case of find(), it will be the first tag within the object. The print(soup.find_all(True)) line of code will print out all the tags associated with the soup object.

In the case of searching for text, passing True will return all text within the document as follows:

all_texts = soup.find_all(text=True)
print(all_texts)

#output
[u'
', u'
', u'
', u'
', u'
', u'plants', u'
', u'100000', 
  u'
', u'
', u'
', u'algae', u'
', u'100000', u'
', u'
', 
    u'
', u'
', u'
', u'deer', u'
', u'1000', u'
', u'
', 
      u'
', u'rabbit', u'
', u'2000', u'
', u'
', u'
', 
        u'
', u'
', u'fox', u'
', u'100', u'
', u'
', u'
', 
          u'bear', u'
', u'100', u'
', u'
', u'
', u'
', 
            u'
', u'lion', u'
', u'80', u'
', u'
', u'
', 
              u'tiger', u'
', u'50', u'
', u'
', u'
', u'
', 
                u'
']
 

The preceding output prints every text content within the soup object including the new-line characters too.

Also, in the case of text, we can pass a list of strings and find_all() will find every string defined in the list:

all_texts_in_list = soup.find_all(text=["plants","algae"])
print(all_texts_in_list)

#output
[u'plants', u'algae']

This is same in the case of searching for the tags, attribute values of tag, custom attributes, and the CSS class.

For finding all the div and li tags, we can use the following code line:

div_li_tags = soup.find_all(["div","li"])

Similarly, for finding tags with the producerlist and primaryconsumerlist classes, we can use the following code lines:

all_css_class = soup.find_all(class_=["producerlist","primaryconsumerlist"])

Both find() and find_all() search an object's descendants (that is, all children coming after it in the tree), their children, and so on. We can control this behavior by using the recursive parameter. If recursive = False, search happens only on an object's direct children.

For example, in the following code, search happens only at direct children for div and li tags. Since the direct child of the soup object is html, the following code will give an empty list:

div_li_tags = soup.find_all(["div","li"],recursive=False)
print(div_li_tags)

#output
[]

If find_all() can't find results, it will return an empty list, whereas find() returns None.

Searching for Tags in relation

Searching for contents within an HTML page is easy using the find() and find_all() methods. During complex web scraping projects, it will be very easy for a user to visit the subsequent tags or the parent tags for extra information.

These tags that we intend to visit will be in a direct relationship with the tag we already searched. For example, we may need to find the immediate parent tag of a particular tag. Also, there will be situations to find the previous tag, next tag, tags that are in the same level (siblings), and so on. In these cases, there are searching methods provided within the BeautifulSoup object, for example, find_parents(), find_next_siblings(), and so on. Normally, we use these methods followed by a find() or find_all() method since these methods are used for finding one particular tag and we are interested in finding the other tags, which are in relation with this tag.

Searching for the parent tags

We can find the parent tags associated with a tag by using the find_parents() or find_parent() methods. Likewise, find_all() and find() differ only in the number of results they return. The find_parents() method returns the entire matching parent tags, whereas find_parent() returns the first immediate parent. The searching methods that we discussed in find() can be used for find_parent() and find_parents().

In the primary consumer example, we can find the immediate parent tag associated with primaryconsumer as follows:

primaryconsumers = soup.find_all(class_="primaryconsumerlist")
primaryconsumer = primaryconsumers[0]
parent_ul = primaryconsumer.find_parents('ul')
print(parent_ul)
#output
<ul id="primaryconsumers">

  <li class="primaryconsumerlist">

    <div class="name">deer</div>

    <div class="number">1000</div>

  </li>

  <li class="primaryconsumerlist">

    <div class="name">rabbit</div>

    <div class="number">2000</div>

  </li>

</ul>

The first line will store all the primary consumers in the primaryconsumers variable. We take the first entry and store that in primaryconsumer. We are trying to find all the <ul> tags, which are the parent tags of the primaryconsumer list that we got. From the preceding output, we can understand that the result will contain the whole structure of the parent, which will also include the tag for which we found the parent. We can use the find_parent() method to find the immediate parent of the tag.

parent_p = primaryconsumer.find_parent("p")

The preceding code will search for the first <p> tag, which is the immediate parent of the <li> tag with the primaryconsumerlist class in the ecological pyramid example.

An easy way to get the immediate parent tag for a particular tag is to use the find_parent() method without any parameter as follows:

immediateprimary_consumer_parent = primary_consumer.find_parent()

The result will be the same as primary_consumer.find_parent('ul') since ul is the immediate parent tag.

Searching for siblings

In a particular html document, we can say that particular tags are siblings, if they are at the same level. For example, in the ecological pyramid, all of the <ul> tags are on the same level and they are siblings if we define a relationship. We can understand this from the following diagram, which is a representation of the relationship between the first div tag and the ul tags:

Searching for siblings

This means that in our example, the ul tags with the classes producers, primaryconsumers, secondaryconsumers, and teritiaryconsumers are siblings.

Also, in the following diagram for producers, we can see that the plants value, which is the first producer and algae, which is the second producer, cannot be treated as siblings, since they are not at the same level:

Searching for siblings

But, both div with the value plants and the value for number 10000 can be considered as siblings, as they are at the same level.

Beautiful Soup comes with methods to help us find the siblings too.

The find_next_siblings() method allows to find the next siblings, whereas find_next_sibling() allows to find the next sibling. In the following example, we can find out the siblings of the producers:

producers= soup.find(id='producers')
next_siblings = producers.find_next_siblings()
print(next_siblings)

#output
[<ul id="primaryconsumers">

  <li class="primaryconsumerlist">

    <div class="name">deer</div>

    <div class="number">1000</div>

  </li>

  <li class="primaryconsumerlist">

    <div class="name">rabbit</div>

    <div class="number">2000</div>

  </li>

</ul>, <ul id="secondaryconsumers">

  <li class="secondaryconsumerlist">

    <div class="name">fox</div>

    <div class="number">100</div>

  </li>

  <li class="secondaryconsumerlist">

    <div class="name">bear</div>

    <div class="number">100</div>

  </li>

</ul>, <ul id="tertiaryconsumers">

  <li class="tertiaryconsumerlist">

    <div class="name">lion</div>

    <div class="number">80</div>

  </li>

  <li class="tertiaryconsumerlist">

    <div class="name">tiger</div>

    <div class="number">50</div>

  </li>

</ul>]

So, we find the next siblings for the producer, which are the primary consumers, secondary consumers, and tertiary consumers.

We can use find_previous_siblings() and find_previous_sibling() to find the previous siblings and previous sibling respectively.

Like other find methods, we can have the different filters such as text, regular expression, attribute value, and tag name to pass to these methods to find the siblings accordingly.

Searching for next

For every tag, there will be a next element that can either be a navigable string, Tag object, or any other BeautifulSoup object. By next element, we mean the element that is parsed immediately after the current element. This is different from a sibling. We have methods to find the next objects for a particular Tag object. The find_all_next() method helps to find all the objects coming after the tag and find_next() finds the first object that comes after the Tag object. In this method, we can also use the different search methods as in find().

For example, we can find all the li tags that come after the first div tag using the following code:

first_div = soup.div
all_li_tags = first_div.find_all_next("li")

#output
[<li class="producerlist">
  <div class="name">plants</div>
  <div class="number">100000</div>
</li>, <li class="producerlist">
  <div class="name">algae</div>
  <div class="number">100000</div>
</li>, <li class="primaryconsumerlist">
  <div class="name">deer</div>
  <div class="number">1000</div>
</li>, <li class="primaryconsumerlist">
  <div class="name">rabbit</div>
  <div class="number">2000</div>
</li>, <li class="secondaryconsumerlist">
  <div class="name">fox</div>
  <div class="number">100</div>
</li>, <li class="secondaryconsumerlist">
  <div class="name">bear</div>
  <div class="number">100</div>
</li>, <li class="tertiaryconsumerlist">
  <div class="name">lion</div>
  <div class="number">80</div>
</li>, <li class="tertiaryconsumerlist">
  <div class="name">tiger</div>
  <div class="number">50</div>
</li>]

Searching for previous

Searching for previous is the opposite case of next, where we can find the previous object associated with a particular object. We can use the find_all_previous() method to find all the previous objects associated with the current object and find_previous() to find the previous object associated with the current object.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.172.50