Case 1 &#x2013; using the HTML parser

Here, we will be using the lxml.html parser to parse xmlFile by passing an argument parser, parser='html', to PyQuery:

# creating PyQuery object using parser 'html'
 urlHTML = pq(xmlFile, parser='html')

print("Children Length: ",urlHTML.children().__len__())
print("First Children: ",urlHTML.children().eq(0))
print("Inner Child/First Children: ",urlHTML.children().children().eq(0))

Using PyQuery's urlHTML object allows us to check the count and the child elements that were obtained from the data, as shown in the following output:

Children Length: 137

First Children: 
<url>
<loc>https://webscraping.com</loc>
</url>

Inner Child/First Children: <loc>https://webscraping.com</loc>

As we can see, urlHTML.children() contains the required elements to look for the URL. We can process this data with the items() method, which traverses through each element that's obtained. Let's create dataSet (Python list()) that will be appended with the URLs that are extracted.

Element-based iteration can be performed with urlHTML.children().find('loc:contains("blog")').items() by using a selector that contains the blog string:

dataSet=list()
for url in urlHTML.children().find('loc:contains("blog")').items():
    dataSet.append(url.text())

print("Length of dataSet: ", len(dataSet))
print(dataSet)

Finally, we will receive the following output:

Length of dataSet: 131

['https://webscraping.com/blog', 'https://webscraping.com/blog/10/', 'https://webscraping.com/blog/11/', 'https://webscraping.com/blog/12/', 'https://webscraping.com/blog/13/', 'https://webscraping.com/blog/2/'
,.................................................................................,
'https://webscraping.com/blog/Reverse-Geocode/', 'https://webscraping.com/blog/Scraping-Flash-based-websites/', 'https://webscraping.com/blog/Scraping-JavaScript-based-web-pages-with-Chickenfoot/', 'https://webscraping.com/blog/category/web2py', 'https://webscraping.com/blog/category/webkit', 'https://webscraping.com/blog/category/website/', 'https://webscraping.com/blog/category/xpath']

Table of Contents for
Case 1 – using the HTML parser

Case 1 – using the HTML parser

Table of Contents for Case 1 &#x2013; using the HTML parser

Create new playlist

Sign In

Sign Up

Table of Contents for
Case 1 – using the HTML parser