To cache downloads, we will first try the obvious solution and save web pages to the filesystem. To do this, we will need a way to map URLs to a safe cross-platform filename. The following table lists the limitations for some popular filesystems:
Operating system |
File system |
Invalid filename characters |
Maximum filename length |
---|---|---|---|
Linux |
Ext3/Ext4 |
/ and |
255 bytes |
OS X |
HFS Plus |
: and |
255 UTF-16 code units |
Windows |
NTFS |
, /, ?, :, *, ", >, <, and | |
255 characters |
To keep our file path safe across these filesystems, it needs to be restricted to numbers, letters, basic punctuation, and replace all other characters with an underscore, as shown in the following code:
>>> import re >>> url = 'http://example.webscraping.com/default/view/Australia-1' >>> re.sub('[^/0-9a-zA-Z-.,;_ ]', '_', url) 'http_//example.webscraping.com/default/view/Australia-1'
Additionally, the filename and the parent directories need to be restricted to 255 characters (as shown in the following code) to meet the length limitations described in the preceding table:
>>> filename = '/'.join(segment[:255] for segment in filename.split('/'))
There is also an edge case that needs to be considered, where the URL path ends with a slash (/
), and the empty string after this slash would be an invalid filename. However, removing this slash to use the parent for the filename would prevent saving other URLs. Consider the following URLs:
If you need to save these, then index needs to be a directory to save the child path for 1. The solution our disk cache will use is appending index.html
to the filename when the URL path ends with a slash. The same applies when the URL path is empty. To parse the URL, we will use the urlparse.urlsplit()
function, which splits a URL into its components:
>>> import urlparse >>> components = urlparse.urlsplit('http://example.webscraping.com/index/') >>> print components SplitResult(scheme='http', netloc='example.webscraping.com', path='/index/', query='', fragment='') >>> print components.path '/index/'
This function provides a convenient interface to parse and manipulate URLs. Here is an example using this module to append index.html
for this edge case:
>>> path = components.path >>> if not path: >>> path = '/index.html' >>> elif path.endswith('/'): >>> path += 'index.html' >>> filename = components.netloc + path + components.query >>> filename 'example.webscraping.com/index/index.html'
In the preceding section, we covered the limitations of the filesystem that need to be considered when building a disk-based cache, namely the restriction on which characters can be used, the filename length, and making sure a file and directory are not created in the same location. Together, using this logic to map a URL to a filename will form the main part of the disk cache. Here is an initial implementation of the DiskCache class:
import os import re import urlparse class DiskCache: def __init__(self, cache_dir='cache'): self.cache_dir = cache_dir self.max_length = max_length def url_to_path(self, url): """Create file system path for this URL """ components = urlparse.urlsplit(url) # append index.html to empty paths path = components.path if not path: path = '/index.html' elif path.endswith('/'): path += 'index.html' filename = components.netloc + path + components.query # replace invalid characters filename = re.sub('[^/0-9a-zA-Z-.,;_ ]', '_', filename) # restrict maximum number of characters filename = '/'.join(segment[:250] for segment in filename.split('/')) return os.path.join(self.cache_dir, filename)
The class constructor shown in the preceding code takes a parameter to set the location of the cache, and then the url_to_path
method applies the filename restrictions that have been discussed so far. Now we just need methods to load and save the data with this filename. Here is an implementation of these missing methods:
import pickle class DiskCache: ... def __getitem__(self, url): """Load data from disk for this URL """ path = self.url_to_path(url) if os.path.exists(path): with open(path, 'rb') as fp: return pickle.load(fp) else: # URL has not yet been cached raise KeyError(url + ' does not exist') def __setitem__(self, url, result): """Save data to disk for this url """ path = self.url_to_path(url) folder = os.path.dirname(path) if not os.path.exists(folder): os.makedirs(folder) with open(path, 'wb') as fp: fp.write(pickle.dumps(result))
In __setitem__()
, the URL is mapped to a safe filename using url_to_path()
, and then the parent directory is created if necessary. The pickle
module is used to convert the input to a string, which is then saved to disk. Also, in __getitem__()
, the URL is mapped to a safe filename. Then, if the filename exists, the content is loaded and unpickled to restore the original data type. If the filename does not exist, that is, there is no data in the cache for this URL, a KeyError
exception is raised.
Now we are ready to try DiskCache
with our crawler by passing it to the cache
callback. The source code for this class is available at https://bitbucket.org/wswp/code/src/tip/chapter03/disk_cache.py and the cache can be tested with the link crawler by running this script:
$ time python disk_cache.py Downloading: http://example.webscraping.com Downloading: http://example.webscraping.com/view/Afghanistan-1 ... Downloading: http://example.webscraping.com/view/Zimbabwe-252 23m38.289s
The first time this command is run, the cache is empty so that all the web pages are downloaded normally. However, when we run this script a second time, the pages will be loaded from the cache so that the crawl should be completed more quickly, as shown here:
$ time python disk_cache.py 0m0.186s
As expected, this time the crawl completed much faster. While downloading with an empty cache on my computer, the crawler took over 23 minutes, while the second time with a full cache in just 0.186 seconds (over 7000 times faster!). The exact time on your computer will differ, depending on your hardware. However, the disk cache will undoubtedly be faster.
To minimize the amount of disk space required for our cache, we can compress the downloaded HTML file. This is straightforward to implement by compressing the pickled string with zlib
before saving to disk, as follows:
fp.write(zlib.compress(pickle.dumps(result)))
Then, decompress the data loaded from the disk, as follows:
return pickle.loads(zlib.decompress(fp.read()))
With this addition of compressing each web page, the cache is reduced from 4.4 MB to 2.3 MB and takes 0.212 seconds to crawl the cached example website on my computer. This is marginally longer than 0.186 seconds with the uncompressed cache. So, if speed is important for your project, you may want to disable compression.
Our current version of the disk cache will save a value to disk for a key, and then return it whenever this key is requested in future. This functionality may not be ideal when caching web pages because online content changes, so the data in our cache would become out of date. In this section, we will add an expiration time to our cached data so that the crawler knows when to redownload a web page. To support storing the timestamp of when each web page was cached is straightforward. Here is an implementation of this:
from datetime import datetime, timedelta class DiskCacke: def __init__(self, ..., expires=timedelta(days=30)): ... self.expires = expires def __getitem__(self, url): """Load data from disk for this URL """ ... with open(path, 'rb') as fp: result, timestamp = pickle.loads(zlib.decompress(fp.read())) if self.has_expired(timestamp): raise KeyError(url + ' has expired') return result else: # URL has not yet been cached raise KeyError(url + ' does not exist') def __setitem__(self, url, result): """Save data to disk for this url """ ... timestamp = datetime.utcnow() data = pickle.dumps((result, timestamp)) with open(path, 'wb') as fp: fp.write(zlib.compress(data)) def has_expired(self, timestamp): """Return whether this timestamp has expired """ return datetime.utcnow() > timestamp + self.expires
In the constructor, the default expiration time is set to 30 days with a timedelta
object. Then, the __set__
method saves the current timestamp in the pickled data and the __get__
method compares this to the expiration time. To test this expiration, we can try a short timeout of 5 seconds, as shown here:
>>> cache = DiskCache(expires=timedelta(seconds=5)) >>> url = 'http://example.webscraping.com' >>> result = {'html': '...'} >>> cache[url] = result >>> cache[url] {'html': '...'} >>> import time; time.sleep(5) >>> cache[url] Traceback (most recent call last): ... KeyError: 'http://example.webscraping.com has expired'
As expected, the cached result is initially available, and then, after sleeping for 5 seconds, calling the same key raises a KeyError
to show this cached download has expired.
Our disk-based caching system was relatively simple to implement, does not depend on installing additional modules, and the results are viewable in our file manager. However, it has the drawback of depending on the limitations of the local filesystem. Earlier in this chapter, we applied various restrictions to map the URL to a safe filename, but an unfortunate consequence of this is that some URLs will map to the same filename. For example, replacing unsupported characters in the following URLs would map them all to the same filename:
http://example.com/?a+b
http://example.com/?a*b
http://example.com/?a=b
http://example.com/?a!b
This means that if one of these URLs were cached, it would look like the other three URLs were cached too, because they map to the same filename. Alternatively, if some long URLs only differed after the 255th character, the chomped versions would also map to the same filename. This is a particularly important problem since there is no defined limit on the maximum length of a URL. Although, in practice, URLs over 2000 characters are rare and older versions of Internet Explorer did not support over 2083 characters.
A potential solution to avoid these limitations is by taking the hash of the URL and using this as the filename. This may be an improvement - however, then we will eventually face a larger problem that many filesystems have; that is, a limit on the number of files allowed per volume and per directory. If this cache is used in a FAT32 filesystem, the maximum number of files allowed per directory is just 65,535. This limitation could be avoided by splitting the cache across multiple directories, however filesystems can also limit the total number of files. My current ext4
partition supports a little over 15 million files, whereas a large website may have excess of 100 million web pages. Unfortunately the DiskCache
approach has too many limitations to be of general use. What we need instead is to combine the multiple cached web pages into a single file and index them with a B+ tree
or similar. Instead of implementing our own, we will use an existing database in the next section.
3.144.224.135