Optical Character Recognition

Optical Character Recognition (OCR) is a process to extract text from images. In this section, we will use the open source Tesseract OCR engine, which was originally developed at HP and now primarily at Google. Installation instructions for Tesseract are available at https://code.google.com/p/tesseract-ocr/wiki/ReadMe. Then, the pytesseract Python wrapper can be installed with pip:

pip install pytesseract

If the original CAPTCHA image is passed to pytesseract, the results are terrible:

>>> import pytesseract
>>> img = get_captcha(html)
>>> pytesseract.image_to_string(img)
''

An empty string was returned, which means Tesseract failed to extract any characters from the input image. Tesseract was designed to extract more typical types of text, such as book pages with a consistent background. If we want to use Tesseract effectively, we will need to first modify the CAPTCHA images to remove the background noise and isolate the text. To better understand the CAPTCHA system we are dealing with, here are some more samples:

Optical Character Recognition

The samples in the preceding screenshot show that the CAPTCHA text is always black while the background is lighter, so this text can be isolated by checking each pixel and only keeping the black ones, a process known as thresholding. This process is straightforward to achieve with Pillow:

>>> img.save('captcha_original.png')
>>> gray = img.convert('L')
>>> gray.save('captcha_gray.png')
>>> bw = gray.point(lambda x: 0 if x < 1 else 255, '1')
>>> bw.save('captcha_thresholded.png')

Here, a threshold of less than 1 was used, which means only keeping pixels that are completely black. This snippet saved three images—the original CAPTCHA image, when converted to grayscale, and after thresholding. Here are the images saved at each stage:

Optical Character Recognition

The text in the final thresholded image is much clearer and is ready to be passed to Tesseract:

>>> pytesseract.image_to_string(bw)
'strange'

Success! The CAPTCHA text has been successfully extracted. In my test of 100 images, this approach correctly interpreted the CAPTCHA image 84 times. Since the sample text is always lowercase ASCII, the performance can be improved further by restricting the result to these characters:

>>> import string
>>> word = pytesseract.image_to_string(bw)
>>> ascii_word = ''.join(c for c in word if c in string.letters).lower()

In my test on the same sample images, this improved performance to 88 times out of 100. Here is the full code of the registration script so far:

import cookielib
import urllib
import urllib2
import string
import pytesseract
REGISTER_URL = 'http://example.webscraping.com/user/register'

def register(first_name, last_name, email, password):
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    html = opener.open(REGISTER_URL).read()
    form = parse_form(html)
    form['first_name'] = first_name
    form['last_name'] = last_name
    form['email'] = email
    form['password'] = form['password_two'] = password
    img = extract_image(html)
    captcha = ocr(img)
    form['recaptcha_response_field'] = captcha
    encoded_data = urllib.urlencode(form)
    request = urllib2.Request(REGISTER_URL, encoded_data)
    response = opener.open(request)
    success = '/user/register' not in response.geturl()
    return success


def ocr(img):
    gray = img.convert('L')
    bw = gray.point(lambda x: 0 if x < 1 else 255, '1')
    word = pytesseract.image_to_string(bw)
    ascii_word = ''.join(c for c in word if c in string.letters).lower()
    return ascii_word

The register() function downloads the registration page and scrapes the form as usual, where the desired name, e-mail, and password for the new account are set. The CAPTCHA image is then extracted, passed to the OCR function, and the result is added to the form. This form data is then submitted and the response URL is checked to see if registration was successful. If it fails, this would still be the registration page, either because the CAPTCHA image was solved incorrectly or an account with this e-mail already existed. Now, to register an account, we simply need to call the register() function with the new account details:

>>> register(first_name, last_name, email, password)
True

Further improvements

To improve the CAPTCHA OCR performance further, there are a number of possibilities, as follows:

  • Experimenting with different threshold levels
  • Eroding the thresholded text to emphasize the shape of characters
  • Resizing the image—sometimes increasing the size helps
  • Training the OCR tool on the CAPCHA font
  • Restricting results to dictionary words

If you are interested in experimenting to improve performance, the sample data used is available at https://bitbucket.org/wswp/code/src/tip/chapter07/samples/. However the current 88 percent accuracy is sufficient for our purposes of registering an account, because actual users will also make mistakes when entering the CAPTCHA text. Even 1 percent accuracy would be sufficient, because the script could be run many times until successful, though this would be rather impolite to the server and may lead to your IP being blocked.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.240.249