Disk cache

To cache downloads, we will first try the obvious solution and save web pages to the filesystem. To do this, we will need a way to map URLs to a safe cross-platform filename. The following table lists the limitations for some popular filesystems:

Operating system

File system

Invalid filename characters

Maximum filename length

Linux

Ext3/Ext4

/ and

255 bytes

OS X

HFS Plus

: and

255 UTF-16 code units

Windows

NTFS

, /, ?, :, *, ", >, <, and |

255 characters

To keep our file path safe across these filesystems, it needs to be restricted to numbers, letters, basic punctuation, and replace all other characters with an underscore, as shown in the following code:

>>> import re
>>> url = 'http://example.webscraping.com/default/view/Australia-1'
>>> re.sub('[^/0-9a-zA-Z-.,;_ ]', '_', url)
'http_//example.webscraping.com/default/view/Australia-1'

Additionally, the filename and the parent directories need to be restricted to 255 characters (as shown in the following code) to meet the length limitations described in the preceding table:

>>> filename = '/'.join(segment[:255] for segment in filename.split('/'))

There is also an edge case that needs to be considered, where the URL path ends with a slash (/), and the empty string after this slash would be an invalid filename. However, removing this slash to use the parent for the filename would prevent saving other URLs. Consider the following URLs:

If you need to save these, then index needs to be a directory to save the child path for 1. The solution our disk cache will use is appending index.html to the filename when the URL path ends with a slash. The same applies when the URL path is empty. To parse the URL, we will use the urlparse.urlsplit() function, which splits a URL into its components:

>>> import urlparse
>>> components = urlparse.urlsplit('http://example.webscraping.com/index/')
>>> print components
SplitResult(scheme='http', netloc='example.webscraping.com', path='/index/', query='', fragment='')
>>> print components.path
'/index/'

This function provides a convenient interface to parse and manipulate URLs. Here is an example using this module to append index.html for this edge case:

>>> path = components.path
>>> if not path:
>>>     path = '/index.html'
>>> elif path.endswith('/'):
>>>     path += 'index.html'
>>> filename = components.netloc + path + components.query
>>> filename
'example.webscraping.com/index/index.html'

Implementation

In the preceding section, we covered the limitations of the filesystem that need to be considered when building a disk-based cache, namely the restriction on which characters can be used, the filename length, and making sure a file and directory are not created in the same location. Together, using this logic to map a URL to a filename will form the main part of the disk cache. Here is an initial implementation of the DiskCache class:

import os
import re
import urlparse

class DiskCache:
    def __init__(self, cache_dir='cache'):
        self.cache_dir = cache_dir
        self.max_length = max_length

    def url_to_path(self, url):
        """Create file system path for this URL
        """
        components = urlparse.urlsplit(url)
        # append index.html to empty paths
        path = components.path
        if not path:
            path = '/index.html'
        elif path.endswith('/'):
            path += 'index.html'
        filename = components.netloc + path + components.query
        # replace invalid characters
        filename = re.sub('[^/0-9a-zA-Z-.,;_ ]', '_', filename)
        # restrict maximum number of characters
        filename = '/'.join(segment[:250] for segment in
            filename.split('/'))
        return os.path.join(self.cache_dir, filename)

The class constructor shown in the preceding code takes a parameter to set the location of the cache, and then the url_to_path method applies the filename restrictions that have been discussed so far. Now we just need methods to load and save the data with this filename. Here is an implementation of these missing methods:

import pickle
class DiskCache:
    ...
    def __getitem__(self, url):
        """Load data from disk for this URL
        """
        path = self.url_to_path(url)
        if os.path.exists(path):
            with open(path, 'rb') as fp:
                return pickle.load(fp)
        else:
            # URL has not yet been cached
            raise KeyError(url + ' does not exist')

    def __setitem__(self, url, result):
        """Save data to disk for this url
        """
        path = self.url_to_path(url)
        folder = os.path.dirname(path)
        if not os.path.exists(folder):
            os.makedirs(folder)
        with open(path, 'wb') as fp:
            fp.write(pickle.dumps(result))

In __setitem__(), the URL is mapped to a safe filename using url_to_path(), and then the parent directory is created if necessary. The pickle module is used to convert the input to a string, which is then saved to disk. Also, in __getitem__(), the URL is mapped to a safe filename. Then, if the filename exists, the content is loaded and unpickled to restore the original data type. If the filename does not exist, that is, there is no data in the cache for this URL, a KeyError exception is raised.

Testing the cache

Now we are ready to try DiskCache with our crawler by passing it to the cache callback. The source code for this class is available at https://bitbucket.org/wswp/code/src/tip/chapter03/disk_cache.py and the cache can be tested with the link crawler by running this script:

$ time python disk_cache.py
Downloading: http://example.webscraping.com
Downloading: http://example.webscraping.com/view/Afghanistan-1
...
Downloading: http://example.webscraping.com/view/Zimbabwe-252
23m38.289s

The first time this command is run, the cache is empty so that all the web pages are downloaded normally. However, when we run this script a second time, the pages will be loaded from the cache so that the crawl should be completed more quickly, as shown here:

$ time python disk_cache.py
0m0.186s

As expected, this time the crawl completed much faster. While downloading with an empty cache on my computer, the crawler took over 23 minutes, while the second time with a full cache in just 0.186 seconds (over 7000 times faster!). The exact time on your computer will differ, depending on your hardware. However, the disk cache will undoubtedly be faster.

Saving disk space

To minimize the amount of disk space required for our cache, we can compress the downloaded HTML file. This is straightforward to implement by compressing the pickled string with zlib before saving to disk, as follows:

fp.write(zlib.compress(pickle.dumps(result)))

Then, decompress the data loaded from the disk, as follows:

return pickle.loads(zlib.decompress(fp.read()))

With this addition of compressing each web page, the cache is reduced from 4.4 MB to 2.3 MB and takes 0.212 seconds to crawl the cached example website on my computer. This is marginally longer than 0.186 seconds with the uncompressed cache. So, if speed is important for your project, you may want to disable compression.

Expiring stale data

Our current version of the disk cache will save a value to disk for a key, and then return it whenever this key is requested in future. This functionality may not be ideal when caching web pages because online content changes, so the data in our cache would become out of date. In this section, we will add an expiration time to our cached data so that the crawler knows when to redownload a web page. To support storing the timestamp of when each web page was cached is straightforward. Here is an implementation of this:

from datetime import datetime, timedelta

class DiskCacke:
    def __init__(self, ..., expires=timedelta(days=30)):
        ...
        self.expires = expires

    def __getitem__(self, url):
        """Load data from disk for this URL
        """
        ...
            with open(path, 'rb') as fp:
                result, timestamp = pickle.loads(zlib.decompress(fp.read()))
                if self.has_expired(timestamp):
                    raise KeyError(url + ' has expired')
                return result
        else:
            # URL has not yet been cached
            raise KeyError(url + ' does not exist')


    def __setitem__(self, url, result):
        """Save data to disk for this url
        """
        ...
        timestamp = datetime.utcnow()
        data = pickle.dumps((result, timestamp))
        with open(path, 'wb') as fp:
            fp.write(zlib.compress(data))

    def has_expired(self, timestamp):
        """Return whether this timestamp has expired
        """
        return datetime.utcnow() > timestamp + self.expires

In the constructor, the default expiration time is set to 30 days with a timedelta object. Then, the __set__ method saves the current timestamp in the pickled data and the __get__ method compares this to the expiration time. To test this expiration, we can try a short timeout of 5 seconds, as shown here:

 >>> cache = DiskCache(expires=timedelta(seconds=5))
 >>> url = 'http://example.webscraping.com'
 >>> result = {'html': '...'}
 >>> cache[url] = result
 >>> cache[url]
 {'html': '...'}
 >>> import time; time.sleep(5)
 >>> cache[url]
 Traceback (most recent call last):
 ...
 KeyError: 'http://example.webscraping.com has expired'

As expected, the cached result is initially available, and then, after sleeping for 5 seconds, calling the same key raises a KeyError to show this cached download has expired.

Drawbacks

Our disk-based caching system was relatively simple to implement, does not depend on installing additional modules, and the results are viewable in our file manager. However, it has the drawback of depending on the limitations of the local filesystem. Earlier in this chapter, we applied various restrictions to map the URL to a safe filename, but an unfortunate consequence of this is that some URLs will map to the same filename. For example, replacing unsupported characters in the following URLs would map them all to the same filename:

  • http://example.com/?a+b
  • http://example.com/?a*b
  • http://example.com/?a=b
  • http://example.com/?a!b

This means that if one of these URLs were cached, it would look like the other three URLs were cached too, because they map to the same filename. Alternatively, if some long URLs only differed after the 255th character, the chomped versions would also map to the same filename. This is a particularly important problem since there is no defined limit on the maximum length of a URL. Although, in practice, URLs over 2000 characters are rare and older versions of Internet Explorer did not support over 2083 characters.

A potential solution to avoid these limitations is by taking the hash of the URL and using this as the filename. This may be an improvement - however, then we will eventually face a larger problem that many filesystems have; that is, a limit on the number of files allowed per volume and per directory. If this cache is used in a FAT32 filesystem, the maximum number of files allowed per directory is just 65,535. This limitation could be avoided by splitting the cache across multiple directories, however filesystems can also limit the total number of files. My current ext4 partition supports a little over 15 million files, whereas a large website may have excess of 100 million web pages. Unfortunately the DiskCache approach has too many limitations to be of general use. What we need instead is to combine the multiple cached web pages into a single file and index them with a B+ tree or similar. Instead of implementing our own, we will use an existing database in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.224.135