Chapter 12. Avoiding Scraping Traps

There are few things more frustrating than scraping a site, viewing the output, and not seeing the data that’s so clearly visible in your browser. Or submitting a form that should be perfectly fine but gets denied by the web server. Or getting your IP address blocked by a site for unknown reasons.

These are some of the most difficult bugs to solve, not only because they can be so unexpected (a script that works just fine on one site might not work at all on another, seemingly identical, site), but because they purposefully don’t have any tell tale error messages or stack traces to use. You’ve been identified as a bot, rejected, and you don’t know why.

In this book, I’ve written about a lot of ways to do tricky things on websites (submitting forms, extracting and cleaning difficult data, executing JavaScript, etc.). This chapter is a bit of a catchall in that the techniques stem from a wide variety of subjects (HTTP headers, CSS, and HTML forms, to name a few). However, they all have something in common: they are meant to overcome an obstacle put in place for the sole purpose of preventing automated web scraping of a site.

Regardless of how immediately useful this information is to you at the moment, I highly recommend you at least skim this chapter. You never know when it might help you solve a very difficult bug or prevent a problem altogether.

A Note on Ethics

In the first few chapters of this book, I discussed the legal gray area that web scraping inhabits, as well as some of the ethical guidelines to scrape by. To be honest, this chapter is, ethically, perhaps the most difficult one for me to write. My websites have been plagued by bots, spammers, web scrapers, and all manner of unwanted virtual guests, as perhaps yours have been. So why teach people how to build better bots? 

There are a few reasons why I believe this chapter is important to include:

  • There are perfectly ethical and legally sound reasons to scrape some websites that do not want to be scraped. In a previous job I had as a web scraper, I performed an automated collection of information from websites that were publishing clients’ names, addresses, telephone numbers, and other personal information to the Internet without their consent. I used the scraped information to make formal requests to the websites to remove this information. In order to avoid competition, these sites guarded this information from scrapers vigilantly. However, my work to ensure the anonymity of my company’s clients (some of whom had stalkers, were the victims of domestic violence, or had other very good reasons to want to keep a low profile) made a compelling case for web scraping, and I was grateful that I had the skills necessary to do the job. 
  • Although it is almost impossible to build a “scraper proof” site (or at least one that can still be easily accessed by legitimate users), I hope that the information in this chapter will help those wanting to defend their websites against malicious attacks. Throughout, I will point out some of the weaknesses in each web scraping technique, which you can use to defend your own site. Keep in mind that most bots on the Web today are merely doing a broad scan for information and vulnerabilities, and employing even a couple of simple techniques described in this chapter will likely thwart 99% of them. However, they are getting more sophisticated every month, and it’s best to be prepared.
  • Like most programmers, I simply don’t believe that withholding any sort of educational information is a net positive thing to do. 

While you’re reading this chapter, keep in mind that many of these scripts and described techniques should not be run against every site you can find. Not only is it not a nice thing to do, but you could wind up receiving a cease-and-desist letter or worse. But I’m not going to pound you over the head with this every time we discuss a new technique. So, for the rest of this book—as the philosopher Gump once said—“That’s all I have to say about that.”

Looking Like a Human

The fundamental challenge for sites that do not want to be scraped is figuring out how to tell bots from humans. Although many of the techniques sites use (such as CAPTCHAs) can be difficult to fool, there are a few fairly easy things you can do to make your bot look more human.

Adjust Your Headers

In Chapter 9, we used the requests module to handle forms on a website. The requests module is also excellent for setting headers. HTTP headers are a list of attributes, or preferences, sent by you every time you make a request to a web server. HTTP defines dozens of obscure header types, most of which are not commonly used. The following seven fields, however, are consistently used by most major browsers when initiating any connection (shown with example data from my own browser):

Host https://www.google.com/
Connection keep-alive
Accept text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36
Referrer https://www.google.com/
Accept-Encoding gzip, deflate, sdch
Accept-Language en-US,en;q=0.8

And here are the headers that a typical Python scraper using the default urllib library might send:

Accept-Encoding identity
User-Agent Python-urllib/3.4

If you’re a website administrator trying to block scrapers, which one are you more likely to let through? 

Fortunately, headers can be completely customized using the requests module. The website https://www.whatismybrowser.com is great for testing browser properties viewable by servers. We’ll scrape this website to verify our cookie settings with the following script:

import requests
from bs4 import BeautifulSoup

session = requests.Session()
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)
                         AppleWebKit 537.36 (KHTML, like Gecko) Chrome",
           "Accept":"text/html,application/xhtml+xml,application/xml;
                     q=0.9,image/webp,*/*;q=0.8"}
url = "https://www.whatismybrowser.com/
       developers/what-http-headers-is-my-browser-sending"
req = session.get(url, headers=headers)

bsObj = BeautifulSoup(req.text)
print(bsObj.find("table",{"class":"table-striped"}).get_text)

The output should show that the headers are now the same ones set in the headers dictionary object in the code.

Although it is possible for websites to check for “humanness” based on any of the properties in HTTP headers, I’ve found that typically the only setting that really matters is the User-Agent. It’s a good idea to keep this one set to something more inconspicuous than Python-urllib/3.4, regardless of what project you are working on. In addition, if you ever encounter an extremely suspicious website, populating one of the commonly used but rarely checked headers such as Accept-Language might be the key to convincing it you’re a human.

Handling Cookies

Handling cookies correctly can alleviate many of scraping problems, although cookies can also be a double-edged sword. Websites that track your progression through a site using cookies might attempt to cut off scrapers that display abnormal behavior, such as completing forms too quickly, or visiting too many pages. Although these behaviors can be disguised by closing and reopening connections to the site, or even changing your IP address (see Chapter 14 for more information on how to do that), if your cookie gives your identity away, your efforts of disguise might be futile.

Cookies can also be very necessary to scrape a site. As shown in Chapter 9, staying logged in on a site requires that you be able to hold and present a cookie from page to page. Some websites don’t even require that you actually log in and get a new version of a cookie every time—merely holding an old copy of a “logged in” cookie and visiting the site is enough. 

If you are scraping a single targeted website or a small number of targeted sites, I recommend examining the cookies generated by those sites and considering which ones you might want your scraper to handle. There are a number of browser plug-ins that can show you how cookies are being set as you visit and move around a site. EditThisCookie, a Chrome extension, is one of my favorites. 

Check out the code samples in “Handling Logins and Cookies” in Chapter 9 for more information about handling cookies using the requests module. Of course, because it is unable to execute JavaScript, the requests module will be unable to handle many of the cookies produced by modern tracking software, such as Google Analytics, which are set only after the execution of client-side scripts (or, sometimes, based on page events, such as button clicks, that happen while browsing the page). In order to handle these, you need to use the Selenium and PhantomJS packages (we covered their installation and basic usage in Chapter 10). 

You can view cookies by visiting any site (http://pythonscraping.com, in this example) and calling get_cookies() on the webdriver:

from selenium import webdriver
driver = webdriver.PhantomJS(executable_path='<Path to Phantom JS>')
driver.get("http://pythonscraping.com")
driver.implicitly_wait(1)
print(driver.get_cookies())

This provides the fairly typical array of Google Analytics cookies:

[{'value': '1', 'httponly': False, 'name': '_gat', 'path': '/', 'expi
ry': 1422806785, 'expires': 'Sun, 01 Feb 2015 16:06:25 GMT', 'secure'
: False, 'domain': '.pythonscraping.com'}, {'value': 'GA1.2.161952506
2.1422806186', 'httponly': False, 'name': '_ga', 'path': '/', 'expiry
': 1485878185, 'expires': 'Tue, 31 Jan 2017 15:56:25 GMT', 'secure': 
False, 'domain': '.pythonscraping.com'}, {'value': '1', 'httponly': F
alse, 'name': 'has_js', 'path': '/', 'expiry': 1485878185, 'expires':
 'Tue, 31 Jan 2017 15:56:25 GMT', 'secure': False, 'domain': 'pythons
craping.com'}]

To manipulate cookies, you can call the delete_cookie()add_cookie(), and delete_all_cookies() functions. In addition, you can save and store cookies for use in other web scrapers. Here’s an example to give you an idea how these functions work together:

from selenium import webdriver

driver = webdriver.PhantomJS(executable_path='<Path to Phantom JS>')
driver.get("http://pythonscraping.com")
driver.implicitly_wait(1)
print(driver.get_cookies())

savedCookies = driver.get_cookies()

driver2 = webdriver.PhantomJS(executable_path='<Path to Phantom JS>')
driver2.get("http://pythonscraping.com")
driver2.delete_all_cookies()
for cookie in savedCookies:
    driver2.add_cookie(cookie)

driver2.get("http://pythonscraping.com")
driver.implicitly_wait(1)
print(driver2.get_cookies())

In this example, the first webdriver retrieves a website, prints the cookies, and then stores them in the variable savedCookies. The second webdriver loads the same website (technical note: it must load the website first so that Selenium knows which website the cookies belong to, even if the act of loading the website does nothing useful for us), deletes all of its cookies and replaces them with the saved cookies from the first webdriver. When loading the page again, the timestamps, codes, and other information in both cookies should be identical. According to Google Analytics, this second webdriver is now identical to the first one. 

Timing Is Everything

Some well-protected websites might prevent you from submitting forms or interacting with the site if you do it too quickly. Even if these security features aren’t in place, downloading lots of information from a website significantly faster than a normal human might is a good way to get yourself noticed, and blocked. 

Therefore, although multithreaded programming might be a great way to load pages faster—allowing you to process data in one thread while repeatedly loading pages in another—it’s a terrible policy for writing good scrapers. You should always try to keep individual page loads and data requests to a minimum. If possible, try to space them out by a few seconds, even if you have to add in an extra:

time.sleep(3)

Although web scraping often involves rule breaking and boundary pushing in order to get data, this is one rule you don’t want to break. Not only does consuming inordinate server resources put you in a legally vulnerable position, but you might cripple or take down smaller sites. Bringing down websites is not an ethically ambiguous situation: it’s just wrong. So watch your speed!

Common Form Security Features

There are many litmus tests that have been used over the years, and continue to be used, with varying degrees of success, to separate web scrapers from browser-using humans. Although it’s not a big deal if a bot downloads some articles and blog posts that were available to the public anyway, it is a big problem if a bot creates thousands of user accounts and starts spamming all of your site’s members. Web forms, especially forms that deal with account creation and logins, pose a significant threat to security and computational overhead if they’re vulnerable to indiscriminate use by bots, so it’s in the best interest of many site owners (or at least they think it is) to try to limit access to the site.

These anti-bot security measures centered on forms and logins can pose a significant challenge to web scrapers. 

Keep in mind that this is only a partial overview of some of the security measures you might encounter when creating automated bots for these forms. Review Chapter 11, on dealing with CAPTCHAs and image processing, as well as Chapter 14, on dealing with headers and IP addresses, for more information on dealing with well-protected forms.

Hidden Input Field Values

“Hidden” fields in HTML forms allow the value contained in the field to be viewable by the browser but invisible to the user (unless they look at the site’s source code). With the increase in use of cookies to store variables and pass them around on websites, hidden fields fell out of a favor for a while before another excellent purpose was discovered for them: preventing scrapers from submitting forms. 

Figure 12-1 shows an example of these hidden fields at work on a Facebook login page. Although there are only three visible fields in the form (username, password, and a submit button), the form conveys a great deal of information to the server behind the scenes.

Alt Text
Figure 12-1. The Facebook login form has quite a few hidden fields

There are two main ways hidden fields are used to prevent web scraping: a field can be populated with a randomly generated variable on the form page that the server is expecting to be posted to the form processing page. If this value is not present in the form, then the server can reasonably assume that the submission did not originate organically from the form page, but was posted by a bot directly to the processing page. The best way to get around this measure is to scrape the form page first, collect the randomly generated variable and then post to the processing page from there.

The second method is a “honey pot” of sorts. If a form contains a hidden field with an innocuous name, such as “username” or “email address,” a poorly written bot might fill out the field and attempt to submit it, regardless of whether it is hidden to the user or not. Any hidden fields with actual values (or values that are different from their defaults on the form submission page) should be disregarded, and the user may even be blocked from the site.

In short: It is sometimes necessary to check the page that the form is on to see if you missed anything that the server might be expecting. If you see several hidden fields, often with large, randomly generated string variables, it is likely that the web server will be checking for their existence on form submission. In addition, there might be other checks to ensure that the form variables have been used only once, are recently generated (this eliminates the possibility of simply storing them in a script and using them over and over again over time), or both.

Avoiding Honeypots

Although CSS for the most part makes life extremely easy when it comes to differentiating useful information from nonuseful information (e.g., by reading the id and class tags), it can occasionally be problematic for web scrapers. If a field on a web form is hidden from a user via CSS, it is reasonable to assume that the average user visiting the site will not be able to fill it out because it doesn’t show up in the browser. If the form is populated, there is likely a bot at work and the post will be discarded.

This applies not only to forms but to links, images, files, and any other item on the site that can be read by a bot but is hidden from the average user visiting the site through a browser. A page visit to a “hidden” link on a site can easily trigger a server-side script that will block the user’s IP address, log that user out of the site, or take some other action to prevent further access. In fact, many business models have been based on exactly this concept.

Take for example the page located at  http://bit.ly/1GP4sbz. This page contains two links, one hidden by CSS and another visible. In addition, it contains a form with two hidden fields:

<html>
<head> 
    <title>A bot-proof form</title>
</head>
<style>
    body { 
        overflow-x:hidden;
    }
    .customHidden { 
        position:absolute; 
        right:50000px;
    }
</style>
<body> 
    <h2>A bot-proof form</h2>
    <a href=
     "http://pythonscraping.com/dontgohere" style="display:none;">Go here!</a>
    <a href="http://pythonscraping.com">Click me!</a>
    <form>
        <input type="hidden" name="phone" value="valueShouldNotBeModified"/><p/>
        <input type="text" name="email" class="customHidden" 
                  value="intentionallyBlank"/><p/>
        <input type="text" name="firstName"/><p/>
        <input type="text" name="lastName"/><p/>
        <input type="submit" value="Submit"/><p/>
   </form>
</body>
</html>

These three elements are hidden from the user in three different ways:

  • The first link is hidden with a simple CSS display:none attribute
  • The phone field is a hidden input field
  • The email field is hidden by moving it 50,000 pixels to the right (presumably off the screen of everyone’s monitors) and hiding the tell tale scroll bar

Fortunately, because Selenium actually renders the pages it visits, it is able to distinguish between elements that are visually present on the page and those that aren’t. Whether the element is present on the page can be determined by the is_displayed() function. 

For example, the following code retrieves the previously described page and looks for hidden links and form input fields:

from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement

driver = webdriver.PhantomJS(executable_path='')
driver.get("http://pythonscraping.com/pages/itsatrap.html")
links = driver.find_elements_by_tag_name("a")
for link in links:
    if not link.is_displayed():
        print("The link "+link.get_attribute("href")+" is a trap")

fields = driver.find_elements_by_tag_name("input")
for field in fields:
    if not field.is_displayed():
        print("Do not change value of "+field.get_attribute("name"))

Selenium catches each hidden field, producing the following output:

The link http://pythonscraping.com/dontgohere is a trap
Do not change value of phone
Do not change value of email

Although you probably don’t want to visit any hidden links you find, you will want to make sure that you submit any pre-populated hidden form values (or have Selenium submit them for you) with the rest of the form. To sum up, it is dangerous to simply ignore hidden fields, although you must be very careful when interacting with them.

The Human Checklist

There’s a lot of information in this chapter, and indeed in this book, about how to build a scraper that looks less like a scraper and more like a human. If you keep getting blocked by websites and you don’t know why, here’s a checklist you can use to remedy the problem:

  • First, if the page you are receiving from the web server is blank, missing information, or is otherwise not what you expect (or have seen in your own browser), it is likely caused by JavaScript being executed on the site to create the page. Review Chapter 10.
  • If you are submitting a form or making a POST request to a website, check the page to make sure that everything the website is expecting you to submit is being submitted and in the correct format. Use a tool such as Chrome’s Network inspector to view an actual POST command sent to the site to make sure you’ve got everything.
  • If you are trying to log into a site and can’t make the login “stick,” or the website is experiencing other strange “state” behavior, check your cookies. Make sure that cookies are being persisted correctly between each page load and that your cookies are sent to the site for every request.
  • If you are getting HTTP errors from the client, especially 403 Forbidden errors, it might indicate that the website has identified your IP address as a bot and is unwilling to accept any more requests. You will need to either wait until your IP address is removed from the list, or obtain a new IP address (either move to a Starbucks or see Chapter 14). To make sure you don’t get blocked again, try the following:

    • Make sure that your scrapers aren’t moving through the site too quickly. Fast scraping is a bad practice that places a heavy burden on the web administrator’s servers, can land you in legal trouble, and is the number-one cause of scrapers getting blacklisted. Add delays to your scrapers and let them run overnight. Remember: Being in a rush to write programs or gather data is a sign of bad project management; plan ahead to avoid messes like this in the first place.
    • The obvious one: change your headers! Some sites will block anything that advertises itself as a scraper. Copy your own browser’s headers if you’re unsure about what some reasonable header values are.
    • Make sure you’re not clicking on or accessing anything that a human normally would not be able to (refer back to “Avoiding Honeypots” for more information).
    • If you find yourself jumping through a lot of difficult hoops to gain access, consider contacting the website administrator to let them know what you’re doing. Try emailing webmaster@<domain name> or admin@<domain name> for permission to use your scrapers. Admins are people, too!
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.48.244