Case 1 – using the HTML parser

Here, we will be using the lxml.html parser to parse xmlFile by passing an argument parser, parser='html', to PyQuery:

# creating PyQuery object using parser 'html'
urlHTML = pq(xmlFile, parser='html')

print("Children Length: ",urlHTML.children().__len__())
print("First Children: ",urlHTML.children().eq(0))
print("Inner Child/First Children: ",urlHTML.children().children().eq(0))

Using PyQuery's urlHTML object allows us to check the count and the child elements that were obtained from the data, as shown in the following output:

Children Length: 137

First Children:
<url>

<loc>https://webscraping.com</loc>
</url>

Inner Child/First Children: <loc>https://webscraping.com</loc>

As we can see, urlHTML.children() contains the required elements to look for the URL. We can process this data with the items() method, which traverses through each element that's obtained. Let's create dataSet (Python list()) that will be appended with the URLs that are extracted.

Element-based iteration can be performed with urlHTML.children().find('loc:contains("blog")').items() by using a selector that contains the blog string:

dataSet=list()
for url in urlHTML.children().find('loc:contains("blog")').items():
dataSet.append(url.text())

print("Length of dataSet: ", len(dataSet))
print(dataSet)

Finally, we will receive the following output:

Length of dataSet: 131

['https://webscraping.com/blog', 'https://webscraping.com/blog/10/', 'https://webscraping.com/blog/11/', 'https://webscraping.com/blog/12/', 'https://webscraping.com/blog/13/', 'https://webscraping.com/blog/2/'
,.................................................................................,
'https://webscraping.com/blog/Reverse-Geocode/', 'https://webscraping.com/blog/Scraping-Flash-based-websites/', 'https://webscraping.com/blog/Scraping-JavaScript-based-web-pages-with-Chickenfoot/', 'https://webscraping.com/blog/category/web2py', 'https://webscraping.com/blog/category/webkit', 'https://webscraping.com/blog/category/website/', 'https://webscraping.com/blog/category/xpath']
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.205.165