Example 2 – looping with XPath and scraping data from multiple pages

In example 1, we tried the simple XPath-based technique for a URL with a limited number of results on a single page. In this case, we will be targeting a food and drink category, that is, http://books.toscrape.com/catalogue/category/books/food-and-drink_33/index.html, which has its content across pages. An XPath-based looping operation will be used in this example, which supports a more effective collection of data.

As we will be dealing with multiple pages, it's good practice to check for a few individual page URLs that can be found in the browser while moving through the listed pages. Most of the time, it might contain some patterns that can solve the puzzle easily, as used in the following code:

import lxml.html
from lxml.etree import XPath

baseUrl = "http://books.toscrape.com/"

#Main URL
bookUrl = "http://books.toscrape.com/catalogue/category/books/food-and-drink_33/index.html"

#Page URL Pattern obtained (eg: page-1.html, page-2.html...)
pageUrl = "http://books.toscrape.com/catalogue/category/books/food-and-drink_33/page-"

bookUrl is the main URL we are interested in; it also contains the page link for the next page, which contains a pattern, as found in pageUrl, for example, page-2.html:

dataSet = []
page=1
totalPages=1
while(page<=totalPages):
print("Rows in Dataset: "+str(len(dataSet)))
if(page==1):
doc = lxml.html.parse(pageUrl+str(page)+".html").getroot()
perPageArticles = doc.xpath("//*[@id="default"]//form/strong[3]/text()")
totalArticles = doc.xpath("//*[@id="default"]//form/strong[1]/text()")
totalPages = round(int(totalArticles[0])/int(perPageArticles[0]))
print(str(totalArticles[0])+" Results, showing "+str(perPageArticles[0])+" Articles per page")
else:
doc = lxml.html.parse(pageUrl+str(page)+".html").getroot()

#used to find page URL pattern
nextPage = doc.xpath("//*[@id="default"]//ul[contains(@class,'pager')]/li[2]/a/@href")
if len(nextPage)>0:
print("Scraping Page "+str(page)+" of "+str(totalPages)+". NextPage > "+str(nextPage[0]))
else:
print("Scraping Page "+str(page)+" of "+str(totalPages))

An empty dataSet list is defined to hold data found from each article across pages.

An individual page URL is obtained by concatenating pageUrl with a page number, and .html. totalPages is found after calculating totalArticles and perPageArticles as traced from the page itself. totalPages obtained will give an exact loop count and is more manageable to apply in the loop (the while loop is found in the code):

articles = XPath("//*[@id='default']//ol/li[position()>0]")

titlePath = XPath(".//article[contains(@class,'product_pod')]/h3/a/text()")
pricePath = XPath(".//article/div[2]/p[contains(@class,'price_color')]/text()")
stockPath = XPath(".//article/div[2]/p[2][contains(@class,'availability')]/text()[normalize-space()]")
imagePath = XPath(".//article/div[1][contains(@class,'image_container')]/a/img/@src")
starRating = XPath(".//article/p[contains(@class,'star-rating')]/@class")

As we can see in the previous code, articles is the major XPath expression used to loop for finding individual elements inside the <article> field. The expression should contain a certain condition that can be fulfilled to preform a loop; in this case, we identified that the <article> field exists inside of the <ol><li> element.

So, we can perform a loop with li[position()>0] that identifies each <article> field found inside <li> until it exists in <ol> with its traced position, that is, articles = XPath("//*[@id='default']//ol/li[position()>0]"):

#looping through 'articles' found in 'doc' i.e each <li><article> found in Page Source
for row in articles(doc):
title = titlePath(row)[0]
price = pricePath(row)[0]
availability = stockPath(row)[0].strip()
image = imagePath(row)[0]
rating = starRating(row)[0]

#cleaning and formatting applied to image and rating
dataSet.append([title,price,availability,image.replace('../../../..',baseUrl),rating.replace('star-rating','')])

page+=1 #updating Page Count for While loop

#Final Dataset with data from all pages.
print(dataSet)

Individual elements of the XPath expression are defined as the titlePath element, the imagePath element, and so on, targeting particular elements whose data is to be obtained. Finally, the expression set for articles is looped into the HTMLElement obtained for each page, that is, the doc element and collects the first occurrence of each title and image element and the other elements found. These collected data are appended to the dataSet field as a list with the cleaning and formatting done, which results in the output shown in the following screenshot:

Output with paging information and dataSet contents
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.134.102.79