Chapter 8. Creating a Web Scraper

In this chapter, we will create a website scraper using the different searching and navigation techniques we studied in the previous chapters. The scraper will visit three websites to find the selling price of books based on the ISBN. We will first find out the book list and selling price from packtpub.com. We will also find the ISBN of the books from packtpub.com and search other websites such as www.amazon.com and www.barnesandnoble.com based on this ISBN. By doing this, we will automate the process of finding the selling price of the books on three websites and will also get a hands-on experience in implementing scrapers for these websites. Since the website structure may change later, the code examples and images used in this chapter may also become invalid. So, it is better to take the examples as a reference and change the code accordingly. It is good to visit these websites for a better understanding.

Getting book details from PacktPub.com

Getting book details from www.packtpub.com is the first step in the creation of the scraper. We need to find the following details from PacktPub.com:

  • Book title name
  • Selling price
  • ISBN

We have seen how to scrape the book title and the selling price from packtpub.com in Chapter 3, Search Using Beautiful Soup. The example we discussed in that chapter considered only the first page and didn't include the other pages that also had the list of books. So in the next topic, we will find different pages containing a list of books.

Finding pages with a list of books

The page at www.packtpub.com/books has the next and previous navigation links to go back and forth between the pages containing a list of books, as shown in the following screenshot:

Finding pages with a list of books

So, we need to find out a method for getting multiple pages that contain the list of books. Logically, it seems to look at the page being pointed at by the next element in the current page. Taking a look at the next element, for page 49, using the Google Chrome developer tools, we can see that it actually points to the next page link, that is, /books?page=49. If we observe different pages using the Google Chrome developer tools, we can see that the link to the next page's actually has a pattern of /books?page=n for the n+1 page, that is, n=49 for the 50th page, as shown in the following screenshot:

Finding pages with a list of books

From the preceding screenshot, we can further understand that the next element is within the <li> tag with class="pager-next last". Inside the <li> tag, there is an <a> tag that holds the link to the next page. In this case, the corresponding value is /books?page=49, which points to the 50th page. We have to add www.packtpub.com to this value to make a valid URL, as www.packtpub.com/books?page=49.

If we analyze the packtpub.com website, we can see that the list of published books ends at page 50. So, we need to ensure that our program stops at this page. The program can stop looking for more pages if it is unable to find the next element.

For example, as shown in the following screenshot, for page 50, we don't have the next element:

Finding pages with a list of books

So at this point, we can stop looking for further pages.

Our logic for getting pages should be as follows:

  1. Start with the first page.
  2. Check if it has a next element:
    • If yes, store the next page URL
    • If no, stop looking for further pages
  3. Load page at URL and repeat the preceding step.

We can use the following code to find pages containing a list of books from packtpub.com:

import urllib2
import re
from bs4 import BeautifulSoup
packtpub_url = "http://www.packtpub.com/"

We stored http://packtpub.com in the packtpub_url variable. Each next element link should be prefixed with packtpub_url to form a valid URL, http://www.packptpub.com/book?page=n, as shown in the following code:

def get_bookurls(url):
  page = urllib2.urlopen(url)
  soup_packtpage = BeautifulSoup(page,"lxml")
  page.close()
  next_page_li = soup_packtpage.find("li", class_="pager-next last")
  if next_page_li is None :
    next_page_url = None
  else:
    next_page_url = packtpub_url+next_page_li.a.get('href')

  return next_page_url

The preceding get_bookurls() function returns the next page URL if we provide the current page URL. For the last page, it returns None.

In get_bookurls(), we created a BeautifulSoup object, soup_packtpage, based on the URL input and then searched for the li tag with the pager-next last class. If find() returns a tag, we can get the link to the next page using next_page_li.a.get('href'). We prefixed this value with packtpub_url and it is returned.

We need a list of such page URLs (for example www.packtpub.com/books, www.packtpub.com/books?page=2, and so on) to collect details of all the books on those pages.

In order to create this list to collect these details, use the following code:

start_url = "www.packtpub.com/books"
continue_scrapping = True
books_url = [start_url]
while continue_scrapping:
  next_page_url= get_bookurls(start_url)
  if next_page_url is None:
    continue_scraping = False
  else:
    books_url.append(next_page_url)
  start_url = next_page_url

In the preceding code, we started with the URL www.packtpub.com/books and stored it in the books_url list. We used a flag, continue_scraping, to control the execution of the function and we can see that the loop will terminate when get_bookurls returns None.

The print(books_url) entry prints the different URL from www.packtpub.com/books to www.packtpub.com/books?page=49.

Finding book details

Now, it is time for us to find the details of each book, such as the book title, selling price, and ISBN. The book title and selling price can be found from the main page with the list of books. But the ISBN can be found only on the details page of each book. So from the main pages, for example, www.packtpub.com/books, we have to find the corresponding link to fetch the details of each book.

We can use the following code to find the details of each book:

def get_bookdetails(url):
  page = urllib2.urlopen(url)
  soup_packtpage = BeautifulSoup(page,"lxml")
  page.close()
  all_books_table = soup_packtpage.find("table",class_="views-view-grid")
  all_book_titles =all_books_table.find_all("div",class_="views-field-title")
  isbn_list = []
  for book_title in all_book_titles:
    book_title_span = book_title.span
    print("Title Name:"+book_title_span.a.string)
    print("Url:"+book_title_span.a.get('href'))
    price = book_title.find_next("div",class_="views-field-sell-price")
    print("PacktPub Price:"+price.span.string)
    isbn_list.append(get_isbn(book_title_span.a.get('href')))
  return isbn_list

The preceding code is the same as the code we used in Chapter 3, Search Using Beautiful Soup, to get the book details. An addition is the use of isbn_list to hold the ISBN numbers and the get_isbn function that returns the ISBN for a particular book.

The ISBN of a book is stored in the book's details page, as shown in the following screenshot:

Finding book details

In the preceding get_bookdetails() function, the book_title_span.a.get('href') function holds the URL to the details page of each book. We pass the preceding value to the get_isbn() function to get the ISBN.

The details page of a book when viewing through the developer tools has the ISBN, as shown in the following screenshot:

Finding book details

From the preceding screenshot, we can see that the ISBN number is stored as text followed by the ISBN inside the b tag.

Now, in the following code, let us see how we can find the ISBN using the get_isbn() function:

def get_isbn(url):
  book_title_url = packtpub_url + url
  page = urllib2.urlopen(book_title_url)
  soup_bookpage = BeautifulSoup(page,"lxml")
  page.close()
  isbn_tag = soup_bookpage.find('b',text=re.compile("ISBN :"))
  return isbn_tag.next_sibling

In the preceding code, we searched for the b tag with the text that matches the pattern ISBN:. The ISBN is next_sibling of the b tag.

In each main page, there will be a list of books, and for each book, there will be an ISBN. So we need to call the get_bookdetails() method for each of the books_url lists as follows:

isbns = []
for bookurl in books_url:
  isbns+= get_bookdetails(bookurl)

The print(isbns) function will print the list of ISBNs for all the books that are currently published by packtpub.com.

We scraped the selling price, book title, and ISBN from the PacktPub website. We will use the ISBN to search for the selling price of the same books in both www.amazon.com and www.barnesandnoble.com. With that, our scraper will be complete.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.18.65