Example 1 – extracting HTML-based content

In this example, we will be using the HTML content from the regexHTML.html file and apply a Regex pattern to extract information such as the following:

HTML elements
The element's attributes (key and values)
The element's content

This example will provide you with a general overview of how we can deal with various elements, values, and so on that exist inside web content and how we can apply Regex to extract that content. The steps we will be applying in the following code will be helpful for processing HTML and similar content:

<html>
<head>
   <title>Welcome to Web Scraping: Example</title>
   <style type="text/css">
        ....
   </style>
</head>
<body>
    <h1 style="color:orange;">Welcome to Web Scraping</h1>
     Links:
    <a href="https://www.google.com" style="color:red;">Google</a>
    <a class="classOne" href="https://www.yahoo.com">Yahoo</a>
    <a id="idOne" href="https://www.wikipedia.org" style="color:blue;">Wikipedia</a>
    <div>
        <p id="mainContent" class="content">
            <i>Paragraph contents</i>
            <img src="mylogo.png" id="pageLogo" class="logo"/>
        </p>
        <p class="content" id="subContent">
            <i style="color:red">Sub paragraph content</i>
            <h1 itemprop="subheading">Sub heading Content!</h1>
        </p>
    </div>
</body>
</html>

The preceding code is the HTML page source we will be using. The content here is structured, and there are numerous ways that we can deal with it.

In the following code, we will be using functions such as the following:

read_file(): This will read the HTML file and return the page source for further processing.
applyPattern(): This accepts a pattern argument, that is, the Regex pattern for finding content, which is applied to the HTML source using re.findall() and prints information such as a list of searched elements and their counts.

To begin with, let's import re and bs4:

import re
from bs4 import BeautifulSoup

def read_file():
   ''' Read and return content from file (.html). '''
    content = open("regexHTML.html", "r")
    pageSource = content.read()
    return pageSource

def applyPattern(pattern):
'''Applies regex pattern provided to Source and prints count and contents'''
    elements = re.findall(pattern, page) #apply pattern to source
    print("Pattern r'{}' ,Found total: {}".format(pattern,len(elements)))
    print(elements) #print all found tags
    return

if __name__ == "__main__":
    page = read_file() #read HTML file

Here, page is an HTML page source that's read from an HTML file using read_file(). We have also imported BeautifulSoup in the preceding code to extract individual HTML tag names and just to compare the implementation of code and results found by using soup.find_all() and a Regex pattern that we will be applying:

soup = BeautifulSoup(page, 'lxml')
print([element.name for element in soup.find_all()])
['html', 'head', 'title', 'style', 'body', 'h1', 'a', 'a', 'a', 'div', 'p', 'i', 'img', 'p', 'i', 'h1']

For finding all of the HTML tags that exist inside page, we used the find_all() method with soup as an object of BeautifulSoup using the lxml parser.

For more information on Beautiful Soup, please visit Chapter 5, Web Scraping using Scrapy and Beautiful Soup, the Web scraping using Beautiful Soup section.

Here, we are finding all HTML tag names that don't have any attributes. w+ matches any word with one or more character:

applyPattern(r'<(w+)>') #Finding Elements without attributes

Pattern r'<(w+)>' ,Found total: 6
['html', 'head', 'title', 'body', 'div', 'i']

Finding all HTML tags or elements that don't end with > or contain some attributes can be found with the help of the space character, that is, s:

applyPattern(r'<(w+)s') #Finding Elements with attributes

Pattern r'<(w+)s' ,Found total: 10
['style', 'h1', 'a', 'a', 'a', 'p', 'img', 'p', 'i', 'h1']

Now, by combining all of these patterns, we are listing all HTML tags that were found in the page source. The same result was also obtained in the previous code by using soup.find_all() and the name attribute:

applyPattern(r'<(w+)s?') #Finding all HTML element

Pattern r'<(w+)s?' ,Found total: 16
['html', 'head', 'title', 'style', 'body', 'h1', 'a', 'a', 'a', 'div', 'p', 'i', 'img', 'p', 'i', 'h1']

Let's find the attribute's name, as found in the HTML element:

applyPattern(r'<w+s+(.*?)=') #Finding attributes name

Pattern r'<w+s+(.*?)=' ,Found total: 10
['type', 'style', 'href', 'class', 'id', 'id', 'src', 'class', 'style', 'itemprop']

As we can see, there were only 10 attributes listed. In the HTML source, a few tags contain more than one attribute, such as <a href="https://www.google.com" style="color:red;">Google</a>, and only the first attribute was found using the provided pattern.

Let's rectify this. We can select words with the = character after them by using the r'(w+)=' pattern, which will result in all of the attributes found in the page source being returned:

applyPattern(r'(w+)=') #Finding names of all attributes

Pattern r'(w+)=' ,Found total: 18
['type', 'style', 'href', 'style', 'class', 'href', 'id', 'href', 'style', 'id', 'class', 'src', 'id', 'class', 'class', 'id', 'style', 'itemprop']

Similarly, let's find all of the values for the attributes we've found. The following code lists the values of the attributes and compares the 18 attributes we listed previously. Only 9 values were found. With the pattern we used here, r'="(w+)"' will only find the word characters. Some of the attribute values contained non-word characters, such as <a href="https://www.google.com" style="color:red;">:

applyPattern(r'="(w+)"')

Pattern r'="(w+)"' ,Found total: 9
['classOne', 'idOne', 'mainContent', 'content', 'pageLogo', 'logo', 'content', 'subContent', 'subheading']

Here, the complete attribute values are listed by using the proper pattern we analyzed. The content attribute values also contained non-word characters such as ;, /, :, and .. In Regex, we can include such characters in the pattern individually, but this approach may not be appropriate in all cases.

In this case, the pattern that includes w and the non-whitespace character, S, fits perfectly, that is, r'="([wS]+)":

applyPattern(r'="([wS]+)"')

Pattern r'="([wS]+)"' ,Found total: 18
['text/css', 'color:orange;', 'https://www.google.com', 'color:red;', 'classOne', 'https://www.yahoo.com', 'idOne', 'https://www.wikipedia.org', 'color:blue;', 'mainContent', 'content', 'mylogo.png', 'pageLogo', 'logo', 'content', 'subContent', 'color:red', 'subheading']

Finally, let's collect all of the text inside the HTML elements that are found in-between the opening and closing HTML tags:

applyPattern(r'>(.*)<')
Pattern r'>(.*)<' ,Found total: 8
['Welcome to Web Scraping: Example', 'Welcome to Web Scraping', 'Google', 'Yahoo', 'Wikipedia', 'Paragraph contents', 'Sub paragraph content', 'Sub heading Content!']

While applying Regex to the content, preliminary analysis for the type of content and the values to be extracted is compulsory. This will help to obtain the required results and can be done in one attempt.

Table of Contents for Example 1&#xA0;&#x2013; extracting HTML-based content

Create new playlist

Sign In

Sign Up

Table of Contents for
Example 1 – extracting HTML-based content