Example 2 – scraping book information

In this example, we will automate the browser to process the category and pagination link from the main URL provided. We are interested in extracting details from the Food and Drink category across multiple pages from http://books.toscrape.com/index.html.

An individual page from the category contains listings of products (Books), with certain information listed as follows:

  • title: Title of the book listed
  • titleLarge: Title of the book listed (complete title, found as a value to the title  attribute) 
  • price: Listed book price 
  • stock: Stock information relating to the listed book
  • imageURL of book image
  • starRating: Rating (number of stars found)
  • url: URL of each listed book.

 

A similar example was also shown in Chapter 3, Using LXML, XPath and CSS Selectors in the section named Web Scraping Using LXML, under the name Example 2 – Looping with XPath and scraping data from multiple pages. There, we used the Python library lxml.  

With selenium.webdriver imported and the Chrome driver path set up, let's start loading http://books.toscrape.com/index.html. As the main page gets loaded, we will see various categories appear, listed one below the other.

The targeted category contains the text Food and Drink, and can be found using find_element_by_link_text() (we can use any applicable find_element... methods to find the particular category). The element found is processed further with click()—clicking on the element returned. This action will load the particular category URL in the browser:

driver.get('http://books.toscrape.com/index.html')

driver.find_element_by_link_text("Food and Drink").click()
print("Current Page URL: ", driver.current_url)
totalBooks = driver.find_element_by_xpath("//*[@id='default']//form/strong[1]")
print("Found: ", totalBooks.text)

To deal with multiple pages that are found during  iteration, NoSuchElementException from selenium.common.exceptions will be imported:

from selenium.common.exceptions import NoSuchElementException

As we will be using the pagination button nextNoSuchElementException will be helpful in dealing with the condition if no further next or pages are found. 

As seen in the following code, the pagination option next is located in the page and processed with the click() action. This action will load the URL it contains to the browser, and the iteration will continue until next is not located or found in the page, caught by the except block in the code:

try:
#Check for Pagination with text 'next'
driver.find_element_by_link_text('next').click()
continue
except NoSuchElementException:
page = False

The complete code for this example is listed as follows:

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
chrome_path = 'chromedriver'
driver = webdriver.Chrome(executable_path=chrome_path)
driver.get('http://books.toscrape.com/index.html')

dataSet = []
driver.find_element_by_link_text("Food and Drink").click()
totalBooks = driver.find_element_by_xpath("//*[@id='default']//form/strong[1]")

page = True
while page:
listings = driver.find_elements_by_xpath("//*[@id='default']//ol/li[position()>0]")
for listing in listings:
url=listing.find_element_by_xpath(".//article[contains(@class,'product_pod')]/h3/a"). get_attribute('href')
title=listing.find_element_by_xpath(".//article[contains(@class,'product_pod')]/h3/a").text
titleLarge=listing.find_element_by_xpath(".//article[contains(@class,'product_pod')]/h3/a"). get_attribute('title')
price=listing.find_element_by_xpath(".//article/div[2]/p[contains(@class,'price_color')]").text
stock=listing.find_element_by_xpath(".//article/div[2]/p[2][contains(@class,'availability')]"). text
image=listing.find_element_by_xpath(".//article/div[1][contains(@class,'image_container')]/a/img") .get_attribute('src')
starRating=listing.find_element_by_xpath(".//article/p[contains(@class,'star-rating')]"). get_attribute('class')
dataSet.append([titleLarge,title,price,stock,image,starRating.replace('star-rating ',''),url])

try:
driver.find_element_by_link_text('next').click()
continue
except NoSuchElementException:
page = False

driver.close()
driver.quit()

Finally, upon completion of the iteration, dataSet will contain the listing data for all pages, as follows:

[['Foolproof Preserving: A Guide to Small Batch Jams, Jellies, Pickles, Condiments, and More: A Foolproof Guide to Making Small Batch Jams, Jellies, Pickles, Condiments, and More', 'Foolproof Preserving: A Guide ...','£30.52','In stock', 'http://books.toscrape.com/media/cache/9f/59/9f59f01fa916a7bb8f0b28a4012179a4.jpg','Three','http://books.toscrape.com/catalogue/foolproof-preserving-a-guide-to-small-batch-jams-jellies-pickles-condiments-and-more-a-foolproof-guide-to-making-small-batch-jams-jellies-pickles-condiments-and-more_978/index.html'], ['The Pioneer Woman Cooks: Dinnertime: Comfort Classics, Freezer Food, 16-Minute Meals, and Other Delicious Ways to Solve Supper!', 'The Pioneer Woman Cooks: ...', '£56.41', 'In stock', 'http://books.toscrape.com/media/cache/b7/f4/b7f4843dbe062d44be1ffcfa16b2faa4.jpg', 'One', 'http://books.toscrape.com/catalogue/the-pioneer-woman-cooks-dinnertime-comfort-classics-freezer-food-16-minute-meals-and-other-delicious-ways-to-solve-supper_943/index.html'],................, 
['Hungry Girl Clean & Hungry: Easy All-Natural Recipes for Healthy Eating in the Real World', 'Hungry Girl Clean & ...', '£33.14', 'In stock', 'http://books.toscrape.com/media/cache/6f/c4/6fc450625cd672e871a6176f74909be2.jpg', 'Three', 'http://books.toscrape.com/catalogue/hungry-girl-clean-hungry-easy-all-natural-recipes-for-healthy-eating-in-the-real-world_171/index.html']]

In this section, we explored the methods and properties from selenium.webdriver and implemented them for web scraping activity.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.128.39