Exploring Beautiful Soup

The Python bs4 library contains a BeautifulSoup class, which is used for parsing. For more details on Beautiful Soup and installing the library, please refer to the official documentation on installing Beautiful Soup at https://www.crummy.com/software/BeautifulSoup/. On successful installation of the library, we can obtain the details as shown in the following screenshot, using Python IDE:

Successful installation of bs4 with details

Also, the collection of simple (named) and explainable methods available as seen in the preceding screenshot and encoding support makes it more popular among developers. 

Let's import BeautifulSoup and SoupStrainer from bs4, as seen in the following code:

from bs4 import BeautifulSoup
from bs4 import SoupStrainer #,BeautifulSoup

We will be using the HTML as shown in the following snippet or html_doc as a sample to explore some of the fundamental features of Beautiful Soup. The response obtained for any chosen URL, using requests or urllib, can also be used for content in real scraping cases:

html_doc="""<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<h1>Secret agents</h1>
<ul>
<li data-id="10784">Jason Walters, 003: Found dead in "A View to a Kill".</li>
<li data-id="97865">Alex Trevelyan, 006: Agent turned terrorist leader; James' nemesis in "Goldeneye".</li>
<li data-id="45732">James Bond, 007: The main man; shaken but not stirred.</li>
</ul>
</body>
</html>"""

To proceed with parsing and accessing Beautiful Soup methods and properties, a Beautiful Soup object, generally known as a soup object, must be created. Regarding the type of string or markup content provided in the constructor, a few examples of creating Beautiful Soup objects, along with the parameters mentioned earlier, are listed next:

  • soup = Beautifulsoup(html_markup)
  • soup = Beautifulsoup(html_markup, 'lxml')
  • soup = Beautifulsoup(html_markup, 'lxml', parse_from=SoupStrainer("a"))
  • soup = Beautifulsoup(html_markup, 'html.parser')
  • soup = Beautifulsoup(html_markup, 'html5lib')
  • soup = Beautifulsoup(xml_markup, 'xml')
  • soup = Beautifulsoup(some_markup, from_encoding='ISO-8859-8')
  • soup = Beautifulsoup(some_markup, exclude_encodings=['ISO-8859-7'])

The Beautiful Soup constructor plays an important part and we will explore some of the important parameters here:

  • markup: The first parameter passed to the constructor accepts a string or objects to be parsed.
  • features: The name of the parser or type of markup to be used for markup. The parser can be lxml, lxml-xml, html.parser, or html5lib. Similarly, markup types that can be used are html, html5, and xml. Different types of supported parsers can be used with Beautiful Soup. If we just want to parse some HTML, we can simply pass the markup to Beautiful Soup and it will use the appropriate parser installed accordingly. For more information on parsers and their installation, please visit installing a parser at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
  • parse_only: Accepts a bs4.SoupStrainer object, that is, only parts of the document matching the SoupStrainer object will be used to parse. It's pretty useful for scraping when only part of the document is to be parsed considering the effectiveness of the code and memory-related issues. For more information on SoupStrainer, please visit parsing only part of a document at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document.
  • from_encoding: Strings indicating the proper encoding are used to parse the markup. This is usually provided if Beautiful Soup is using the wrong encoding.
  • exclude_encodings: A list of strings indicating the wrong encodings if used by Beautiful Soup.
Response time is a considerable factor when using Beautiful Soup. As Beautiful Soup uses the parsers (lxml, html.parser, and html5lib), there is always a concern regarding the extra time consumption.

Using a parser is always recommended to obtain similar results across platforms and systems. Also, for speeding up, it is recommended to use lxml as the parser with Beautiful Soup.

For this particular case, we will be creating the soupA object using lxml as a parser, along with the SoupStrainer object tagsA (parsing only <a>, that is, the elements or anchor tag of HTML). We can obtain partial content to parse using SoupStrainer, which is very useful when dealing with heavy content. 

soupA, an object of Beautiful Soup, presents all of the <a> elements found for the SoupStrainer object tagsA as used in the following code; as seen in the output, only the <a> tag has been collected, or the parsed document is the SoupStrainer object parsed using lxml:

tagsA = SoupStrainer("a")
soupA = BeautifulSoup(html_doc,'lxml',parse_only=tagsA)

print
(type(soupA))
<class 'bs4.BeautifulSoup'>

print(soupA)
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a><a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

HTML content, available from the website, might not always be formatted in a clean string. It would be difficult and time-consuming to read page content that is presented as paragraphs rather than as a line-by-line code.

The Beautiful Soup prettify() function returns a Unicode string, presents the string in a clean, formatted structure that is easy to read, and identifies the elements in a tree structure as seen in the following code; the prettify() function also accepts the parameter encoding:

print(soupA.prettify())

<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>

Document-based elements (such as HTML tags) in a parsed tree can have various attributes with predefined values. Element attributes are important resources as they provide identification and content together within the element. Verifying whether the element contains certain attributes can be handy when traversing through the tree.

For example, as seen in the following code, the HTML <a> element contains the class, href, and id attributes, each carrying predefined values, as seen in the following snippet:

<a class="sister" href="http://example.com/lacie" id="link2">

The has_attr() function from Beautiful Soup returns a Boolean response to the searched attribute name for the chosen element, as seen in the following code element a:

  • Returns False for the name attribute
  • Returns True for the class attribute

We can use the has_attr() function to confirm the attribute keys by name, if it exists inside the parsed document as follows:

print(soupA.a.has_attr('class'))
True

print(soupA.a.has_attr('name'))
False

With a basic introduction to Beautiful Soup and a few methods explored in this section, we will now move forward for searching, traversing, and iterating through the parsed tree looking for elements and their content in the upcoming section. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.238.159