In this chapter, we will create a website scraper using the different searching and navigation techniques we studied in the previous chapters. The scraper will visit three websites to find the selling price of books based on the ISBN. We will first find out the book list and selling price from packtpub.com. We will also find the ISBN of the books from packtpub.com and search other websites such as www.amazon.com and www.barnesandnoble.com based on this ISBN. By doing this, we will automate the process of finding the selling price of the books on three websites and will also get a hands-on experience in implementing scrapers for these websites. Since the website structure may change later, the code examples and images used in this chapter may also become invalid. So, it is better to take the examples as a reference and change the code accordingly. It is good to visit these websites for a better understanding.
Getting book details from www.packtpub.com is the first step in the creation of the scraper. We need to find the following details from PacktPub.com:
We have seen how to scrape the book title and the selling price from packtpub.com in Chapter 3, Search Using Beautiful Soup. The example we discussed in that chapter considered only the first page and didn't include the other pages that also had the list of books. So in the next topic, we will find different pages containing a list of books.
The page at www.packtpub.com/books has the next and previous navigation links to go back and forth between the pages containing a list of books, as shown in the following screenshot:
So, we need to find out a method for getting multiple pages that contain the list of books. Logically, it seems to look at the page being pointed at by the next element in the current page. Taking a look at the next element, for page 49, using the Google Chrome developer tools, we can see that it actually points to the next page link, that is, /books?page=49
. If we observe different pages using the Google Chrome developer tools, we can see that the link to the next page's actually has a pattern of /books?page=n
for the n+1 page, that is, n=49
for the 50th page, as shown in the following screenshot:
From the preceding screenshot, we can further understand that the next element is within the <li>
tag with class="pager-next last"
. Inside the <li>
tag, there is an <a>
tag that holds the link to the next page. In this case, the corresponding value is /books?page=49
, which points to the 50th page. We have to add www.packtpub.com
to this value to make a valid URL, as www.packtpub.com/books?page=49.
If we analyze the packtpub.com website, we can see that the list of published books ends at page 50. So, we need to ensure that our program stops at this page. The program can stop looking for more pages if it is unable to find the next element.
For example, as shown in the following screenshot, for page 50, we don't have the next element:
So at this point, we can stop looking for further pages.
Our logic for getting pages should be as follows:
We can use the following code to find pages containing a list of books from packtpub.com:
import urllib2 import re from bs4 import BeautifulSoup packtpub_url = "http://www.packtpub.com/"
We stored http://packtpub.com in the packtpub_url
variable. Each next element link should be prefixed with packtpub_url
to form a valid URL, http://www.packptpub.com/book?page=n, as shown in the following code:
def get_bookurls(url): page = urllib2.urlopen(url) soup_packtpage = BeautifulSoup(page,"lxml") page.close() next_page_li = soup_packtpage.find("li", class_="pager-next last") if next_page_li is None : next_page_url = None else: next_page_url = packtpub_url+next_page_li.a.get('href') return next_page_url
The preceding get_bookurls()
function returns the next page URL if we provide the current page URL. For the last page, it returns None
.
In get_bookurls()
, we created a BeautifulSoup object, soup_packtpage
, based on the URL input and then searched for the li
tag with the pager-next last
class. If find()
returns a tag, we can get the link to the next page using next_page_li.a.get('href')
. We prefixed this value with packtpub_url
and it is returned.
We need a list of such page URLs (for example www.packtpub.com/books, www.packtpub.com/books?page=2, and so on) to collect details of all the books on those pages.
In order to create this list to collect these details, use the following code:
start_url = "www.packtpub.com/books" continue_scrapping = True books_url = [start_url] while continue_scrapping: next_page_url= get_bookurls(start_url) if next_page_url is None: continue_scraping = False else: books_url.append(next_page_url) start_url = next_page_url
In the preceding code, we started with the URL www.packtpub.com/books and stored it in the books_url
list. We used a flag, continue_scraping
, to control the execution of the function and we can see that the loop will terminate when get_bookurls
returns None
.
The print(books_url)
entry prints the different URL from www.packtpub.com/books to www.packtpub.com/books?page=49.
Now, it is time for us to find the details of each book, such as the book title, selling price, and ISBN. The book title and selling price can be found from the main page with the list of books. But the ISBN can be found only on the details page of each book. So from the main pages, for example, www.packtpub.com/books, we have to find the corresponding link to fetch the details of each book.
We can use the following code to find the details of each book:
def get_bookdetails(url): page = urllib2.urlopen(url) soup_packtpage = BeautifulSoup(page,"lxml") page.close() all_books_table = soup_packtpage.find("table",class_="views-view-grid") all_book_titles =all_books_table.find_all("div",class_="views-field-title") isbn_list = [] for book_title in all_book_titles: book_title_span = book_title.span print("Title Name:"+book_title_span.a.string) print("Url:"+book_title_span.a.get('href')) price = book_title.find_next("div",class_="views-field-sell-price") print("PacktPub Price:"+price.span.string) isbn_list.append(get_isbn(book_title_span.a.get('href'))) return isbn_list
The preceding code is the same as the code we used in Chapter 3, Search Using Beautiful Soup, to get the book details. An addition is the use of isbn_list
to hold the ISBN numbers and the get_isbn
function that returns the ISBN for a particular book.
The ISBN of a book is stored in the book's details page, as shown in the following screenshot:
In the preceding get_bookdetails()
function, the book_title_span.a.get('href')
function holds the URL to the details page of each book. We pass the preceding value to the get_isbn()
function to get the ISBN.
The details page of a book when viewing through the developer tools has the ISBN, as shown in the following screenshot:
From the preceding screenshot, we can see that the ISBN number is stored as text followed by the ISBN inside the b
tag.
Now, in the following code, let us see how we can find the ISBN using the get_isbn()
function:
def get_isbn(url): book_title_url = packtpub_url + url page = urllib2.urlopen(book_title_url) soup_bookpage = BeautifulSoup(page,"lxml") page.close() isbn_tag = soup_bookpage.find('b',text=re.compile("ISBN :")) return isbn_tag.next_sibling
In the preceding code, we searched for the b
tag with the text that matches the pattern ISBN:
. The ISBN is next_sibling
of the b
tag.
In each main page, there will be a list of books, and for each book, there will be an ISBN. So we need to call the get_bookdetails()
method for each of the books_url
lists as follows:
isbns = [] for bookurl in books_url: isbns+= get_bookdetails(bookurl)
The print(isbns)
function will print the list of ISBNs for all the books that are currently published by packtpub.com.
We scraped the selling price, book title, and ISBN from the PacktPub website. We will use the ISBN to search for the selling price of the same books in both www.amazon.com and www.barnesandnoble.com. With that, our scraper will be complete.
3.136.18.65