Example 3 – extracting XML content

In this example, we will be extracting contents from the sitemap.xml file, which can be downloaded from https://webscraping.com/sitemap.xml:

The sitemap.xml file from https://webscraping.com

By analyzing the XML content, we can see that different types of URLs exist as child nodes, that is, <loc>. From these URLs, we will be extracting the following:

Blog titles and category titles that are obtained from code are retrieved from the URL or representations of the real content that's available from the URL. Actual titles might be different. 

To begin with, let's import the re Python library and read the file's contents, as well as create a few Python lists in order to collect relevant data:

import re

filename = 'sitemap.xml'
dataSetBlog = [] # collect Blog title information from URLs except 'category'
dataSetBlogURL = [] # collects Blog URLs
dataSetCategory = [] # collect Category title
dataSetCategoryURL = [] # collect Category URLs

page = open(filename, 'r').read()

From the XML content, that is, page, we need to find the URL pattern. pattern used in code matches and returns all of the URLs inside the <loc> node. urlPatterns (<class 'list'>) is a Python list object that contains searched URLs and is iterated to collect and process the desired information:

#Pattern to be searched, found inside <loc>(.*)</loc>
pattern = r"loc>(.*)</loc"

urlPatterns = re.findall(pattern, page) #finding pattern on page

for url in urlPatterns: #iterating individual url inside urlPatterns

Now, let's match a url, such as https://webscraping.com/blog/Google-App-Engine-limitations/, which contains a blog string and append it to dataSetBlogURL. There are also few other URLs, such as https://webscraping.com/blog/8/, which will be ignored while we extract blogTitle

Also, any blogTitle that's found as text equal to category will be ignored. The r'blog/([A-Za-z0-9-]+) pattern matches alphabetical and numerical values with the - character:

if re.match(r'.*blog', url): #Blog related
dataSetBlogURL.append(url)
if re.match(r'[w-]', url):
blogTitle = re.findall(r'blog/([A-Za-z0-9-]+)', url)

if len(blogTitle) > 0 and not re.match('(category)', blogTitle[0]):
#blogTitle is a List, so index is applied.
dataSetBlog.append(blogTitle[0])

Here's the output for dataSetBlogURL:

print("Blogs URL: ", len(dataSetBlogURL))
print(dataSetBlogURL)

Blogs URL: 80
['https://webscraping.com/blog', 'https://webscraping.com/blog/10/',
'https://webscraping.com/blog/11/', .......,
'https://webscraping.com/blog/category/screenshot', 'https://webscraping.com/blog/category/sitescraper', 'https://webscraping.com/blog/category/sqlite', 'https://webscraping.com/blog/category/user-agent', 'https://webscraping.com/blog/category/web2py', 'https://webscraping.com/blog/category/webkit', 'https://webscraping.com/blog/category/website/', 'https://webscraping.com/blog/category/xpath']

dataSetBlog will contain the following titles (URL portion). The set() method, when applied to dataSetBlog, will return unique elements from dataSetBlog. As shown in the following code, there's no duplicate title inside dataSetBlog:

print("Blogs Title: ", len(dataSetBlog))
print("Unique Blog Count: ", len(set(dataSetBlog)))
print(dataSetBlog)
#print(set(dataSetBlog)) #returns unique element from List similar to dataSetBlog.


Blogs Title: 24
Unique Blog Count: 24


['Android-Apps-Update', 'Apple-Apps-Update', 'Automating-CAPTCHAs', 'Automating-webkit', 'Bitcoin', 'Client-Feedback', 'Fixed-fee-or-hourly', 'Google-Storage', 'Google-interview', 'How-to-use-proxies', 'I-love-AJAX', 'Image-efficiencies', 'Luminati', 'Reverse-Geocode', 'Services', 'Solving-CAPTCHA', 'Startup', 'UPC-Database-Update', 'User-agents', 'Web-Scrapping', 'What-is-CSV', 'What-is-web-scraping', 'Why-Python', 'Why-web']

Now, let's extract information that's relevant to the URL by using category. The r'.*category' Regex pattern, which matches url from the iteration, is collected or appended to datasetCategoryURL. categoryTitle is extracted from url that matches the r'category/([ws-]+) pattern and is added to dataSetCategory:

if re.match(r'.*category', url): #Category Related
dataSetCategoryURL.append(url)
categoryTitle = re.findall(r'category/([ws-]+)', url)
dataSetCategory.append(categoryTitle[0])

print("Category URL Count: ", len(dataSetCategoryURL))
print(dataSetCategoryURL)

dataSetCategoryURL will result in the following values:

Category URL Count: 43
['https://webscraping.com/blog/category/ajax', 'https://webscraping.com/blog/category/android/', 'https://webscraping.com/blog/category/big picture', 'https://webscraping.com/blog/category/business/', 'https://webscraping.com/blog/category/cache', 'https://webscraping.com/blog/category/captcha', ..................................., 'https://webscraping.com/blog/category/sitescraper', 'https://webscraping.com/blog/category/sqlite', 'https://webscraping.com/blog/category/user-agent', 'https://webscraping.com/blog/category/web2py', 'https://webscraping.com/blog/category/webkit', 'https://webscraping.com/blog/category/website/', 'https://webscraping.com/blog/category/xpath']

Finally, the following output displays the title that was retrieved from dataSetCategory, as well as its counts:

print("Category Title Count: ", len(dataSetCategory))
print("Unique Category Count: ", len(set(dataSetCategory)))
print(dataSetCategory)
#returns unique element from List similar to dataSetCategory.
#print(set(dataSetCategory))

Category Title Count: 43
Unique Category Count: 43

['ajax', 'android', 'big picture', 'business', 'cache', 'captcha', 'chickenfoot', 'concurrent', 'cookies', 'crawling', 'database', 'efficiency', 'elance', 'example', 'flash', 'freelancing', 'gae', 'google', 'html', 'image', 'ip', 'ir', 'javascript', 'learn', 'linux', 'lxml', 'mobile', 'mobile apps', 'ocr', 'opensource', 'proxies', 'python', 'qt', 'regex', 'scrapy', 'screenshot', 'sitescraper', 'sqlite', 'user-agent', 'web2py', 'webkit', 'website', 'xpath']

From these example cases, we can see that, by using Regex, we can write patterns that target specific data from sources such as web pages, HTML, or XML.

Regex features such as searching, splitting, and iterating can be implemented with the help of various functions from the re Python library. Although Regex can be implemented on any type of content, unstructured content is preferred. Structured web content with elements that carry attributes are preferred when using XPath and CSS selectors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.201.206