Searching, traversing, and iterating

Beautiful Soup provides a lot of methods and properties to traverse and search elements in the parsed tree. These methods are often named in a similar way to their implementation, describing the task they perform. There are also a number of properties and methods that can be linked together and used to obtain a similar result. 

The find() function returns the first child that is matched for the searched criteria or parsed element. It's pretty useful in scraping context for finding elements and extracting details, but only for the single result. Additional parameters can also be passed to the find() function to identify the exact element, as listed:

  • attrs: A dictionary with a key-value pair
  • text: With element text
  • name: HTML tag name

Let's implement the find() function with different, allowed parameters in the code:

print(soupA.find("a")) #print(soupA.find(name="a"))
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(soupA.find("a",attrs={'class':'sister'}))
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(soupA.find("a",attrs={'class':'sister'},text="Lacie"))
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

print(soupA.find("a",attrs={'id':'link3'}))
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

print(soupA.find('a',id="link2"))
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

Here is a list of short descriptions of codes implemented in the preceding example:

  • find("a") or find(name="a"): Search the HTML <a> element or tag name provided that a returns the first existence of <a> found in soupA
  • find("a",attrs={'class':'sister'}): Search element <a>, with attribute key as class and value as sister
  • find("a",attrs={'class':'sister'}, text="Lacie"): Search the <a> element with the class attribute key and the sister value and text with the Lacie value
  • find("a",attrs={'id':'link3'}): Search the <a> element with the id attribute key and the link3 value
  • find("a",id="link2"): Search the <a> element for the id attributwith the link2 value

The find_all() function works in a similar way to the find() function with the additional attrs and text as a parameters and returns a list of matched (multiple) elements for the provided criteria or name attribute as follows: 

#find all <a> can also be written as #print(soupA.find_all(name="a"))
print(soupA.find_all("a"))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#find all <a>, but return only 2 of them
print(soupA.find_all("a",limit=2)) #attrs, text

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

The additional limit parameter, which accepts numeric values, controls the total count of the elements to be returned using the find_all() function.

The string, list of strings, regular expression objects, or any of these, can be provided to the name and text attributes as a value for attrs parameters, as seen in the code used in the following snippet:

print(soupA.find("a",text=re.compile(r'cie'))) #import re
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

print
(soupA.find_all("a",attrs={'id':re.compile(r'3')}))
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soupA.find_all(re.compile(r'a')))
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

The find_all() function has in-built support for global attributes such as class name along with a name as seen in the following:

soup = BeautifulSoup(html_doc,'lxml')

print
(soup.find_all("p","story")) #class=story
[<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

print
(soup.find_all("p","title")) #soup.find_all("p",attrs={'class':"title"})
[<p class="title"><b>The Dormouse's story</b></p>]

Multiple name and attrs values can also be passed through a list as shown in the following syntax:

  • soup.find_all("p",attrs={'class':["title","story"]}): Finding all the <p> elements with the class attribute title and story values
  • soup.find_all(["p","li"]): Finding all the <p> and <li> elements from the soup object

The preceding syntax can be observed in the following code:

print(soup.find_all("p",attrs={'class':["title","story"]}))
[<p class="title"><b>The Dormouse's story</b></p>,
<p class="story">Once upon a...
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,....
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

print(soup.find_all(["p","li"]))
[<p class="title"><b>The Dormouse's story</b></p>,
<p class="story">Once...<a class="sister" href="http://example.com/elsie"....,
<p class="story">...</p>,
<li data-id="10784">Jason Walters, 003:....</li>,<li.....,
<li data-id="45732">James Bond, 007: The main man; shaken but not stirred.</li>]

We can also use element text to search and list the content. A string parameter, similar to a text parameter, is used for such cases; it can also be used with, or without, any tag names as in the following code:

print(soup.find_all(string="Elsie")) #text="Elsie"
['Elsie']

print(soup.find_all(text=re.compile(r'Elsie'))) #import re
['Elsie']

print(soup.find_all("a",string="Lacie")) #text="Lacie"
[<a class="sister" href="http://example.com/elsie" id="link2">Lacie</a>]

Iteration through elements can also be achieved using the find_all() function. As can be seen in the following code, we are retrieving all of the <li> elements found inside the <ul> element and printing their tag name, attribute data, ID, and text:

for li in soup.ul.find_all('li'):
print(li.name, ' > ',li.get('data-id'),' > ', li.text)

li > 10784 > Jason Walters, 003: Found dead in "A View to a Kill".
li > 97865 > Alex Trevelyan, 006: Agent turned terrorist leader; James' nemesis in "Goldeneye".
li > 45732 > James Bond, 007: The main man; shaken but not stirred.
The elements value attribute can be retrieved using the get() function as seen in the preceding code. Also, the presence of attributes can be checked using the has_attr() function.

Element traversing can also be done with just a tag name, and with, or without, using the find() or find_all() functions as seen in the following code: 

print(soupA.a) #tag a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(soup.li) #tag li
<li data-id="10784">Jason Walters, 003: Found dead in "A View to a Kill".</li>

print(soup.p)
<p class="title"><b>The Dormouse's story</b></p>

print(soup.p.b) #tag p and b
<b>The Dormouse's story</b>

print(soup.ul.find('li',attrs={'data-id':'45732'}))
<li data-id="45732">James Bond, 007: The main man; shaken but not stirred.</li>

The text and string attributes or the get_text() method can be used with the elements to extract their text while traversing through the elements used in the following code. There's also a parameter text and string in the find() or find_all() functions, which are used to search the content as shown in the following code:

print(soup.ul.find('li',attrs={'data-id':'45732'}).text)
James Bond, 007: The main man; shaken but not stirred.

print(soup.p.text) #get_text()
The Dormouse's story

print(soup.li.text)
Jason Walters, 003: Found dead in "A View to a Kill".

print(soup.p.string)
The Dormouse's story

In this section, we explored searching and traversing using elements and by implementing important functions such as the find() and find_all() functions alongside their appropriate parameters and criteria. 

In the next sections, we will explore elements based on their positions in the parsed tree. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.82.252