Database cache

To avoid the anticipated limitations to our disk-based cache, we will now build our cache on top of an existing database system. When crawling, we may need to cache massive amounts of data and will not need any complex joins, so we will use a NoSQL database, which is easier to scale than a traditional relational database. Specifically, our cache will use MongoDB, which is currently the most popular NoSQL database.

What is NoSQL?

NoSQL stands for Not Only SQL and is a relatively new approach to database design. The traditional relational model used a fixed schema and splits the data into tables. However, with large datasets, the data is too big for a single server and needs to be scaled across multiple servers. This does not fit well with the relational model because, when querying multiple tables, the data will not necessarily be available on the same server. NoSQL databases, on the other hand, are generally schemaless and designed from the start to shard seamlessly across servers. There have been multiple approaches to achieve this that fit under the NoSQL umbrella. There are column data stores, such as HBase; key-value stores, such as Redis; document-oriented databases, such as MongoDB; and graph databases, such as Neo4j.

Installing MongoDB

MongoDB can be downloaded from https://www.mongodb.org/downloads. Then, the Python wrapper needs to be installed separately using this command:

pip install pymongo

To test whether the installation is working, start MongoDB locally using this command:

$ mongod -dbpath .

Then, try connecting to MongoDB from Python using the default MongoDB port:

>>> from pymongo import MongoClient
>>> client = MongoClient('localhost', 27017)

Overview of MongoDB

Here is an example of how to save some data to MongoDB and then load it:

>>> url = 'http://example.webscraping.com/view/United-Kingdom-239'
>>> html = '...'
>>> db = client.cache 
>>> db.webpage.insert({'url': url, 'html': html})
ObjectId('5518c0644e0c87444c12a577')
>>> db.webpage.find_one(url=url)
{u'_id': ObjectId('5518c0644e0c87444c12a577'), 
u'html': u'...', 
u'url': u'http://example.webscraping.com/view/United-Kingdom-239'}

A problem with the preceding example is that if we now insert another document with the same URL, MongoDB will happily insert it for us, as follows:

>>> db.webpage.insert({'url': url, 'html': html})
>>> db.webpage.find(url=url).count()
2

Now we have multiple records for the same URL when we are only interested in storing the latest data. To prevent duplicates, we can set the ID to the URL and perform upsert, which means updating the existing record if it exists; otherwise, insert a new one, as shown here:

>>> self.db.webpage.update({'_id': url}, {'$set': {'html': html}}, upsert=True)
>>> db.webpage.update({'_id': url}, {'$set': {'html': ''}}, upsert=True)
>>> db.webpage.find_one({'_id': url})
{u'_id': u'http://example.webscraping.com/view/United-Kingdom-239', u'html': u'...'}

Now, when we try inserting a record with the same URL as shown in the following code, the content will be updated instead of creating duplicates:

>>> new_html = '<html></html>'
>>> db.webpage.update({'_id': example_url}, {'$set': {'html': new_html}}, upsert=True)
>>> db.webpage.find_one({'_id': url})
{u'_id': u'http://example.webscraping.com/view/United-Kingdom-239',
u'html': u'<html></html>'}
>>> db.webpage.find({'_id': url}).count()
1

We can see that after adding this record, the HTML has been updated and the number of records for this URL is still 1.

Note

The official MongoDB documentation, which is available at http://docs.mongodb.org/manual/, covers these features and others in detail.

MongoDB cache implementation

Now we are ready to build our cache on MongoDB using the same class interface as the earlier DiskCache class:

from datetime import datetime, timedelta
from pymongo import MongoClient

class MongoCache:
    def __init__(self, client=None, expires=timedelta(days=30)):
        # if a client object is not passed then try
        # connecting to mongodb at the default localhost port
        self.client = MongoClient('localhost', 27017)
            if client is None else client
        # create collection to store cached webpages,
        # which is the equivalent of a table
        # in a relational database
        self.db = client.cache
        # create index to expire cached webpages
        self.db.webpage.create_index('timestamp',
            expireAfterSeconds=expires.total_seconds())

    def __getitem__(self, url):
        """Load value at this URL
        """
        record = self.db.webpage.find_one({'_id': url})
        if record:
            return record['result']
        else:
            raise KeyError(url + ' does not exist')

    def __setitem__(self, url, result):
        """Save value for this URL
        """
        record = {'result': result, 'timestamp':
            datetime.utcnow()}
        self.db.webpage.update({'_id': url}, {'$set': record},
            upsert=True)

The __getitem__ and __setitem__ methods here should be familiar to you from the discussion on how to prevent duplicates in the preceding section. You may have also noticed that a timestamp index was created in the constructor. This is a handy MongoDB feature that will automatically delete records in a specified number of seconds after the given timestamp. This means that we do not need to manually check whether a record is still valid, as in the DiskCache class. Let's try it out with an empty timedelta object so that the record should be deleted immediately:

>>> cache = MongoCache(expires=timedelta())
>>> cache[url] = result
>>> cache[url]

The record is still there; it seems that our cache expiration is not working. The reason for this is that MongoDB runs a background task to check for expired records every minute, so this record has not yet been deleted. If we wait for a minute, we would find that the cache expiration is working:

>>> import time; time.sleep(60)
>>> cache[url] 
Traceback (most recent call last):
...
KeyError: 'http://example.webscraping.com/view/United-Kingdom-239 does not exist'

This means that our MongoDB cache will not expire records at exactly the time given, and there will be up to a 1 minute delay. However, since typically a cache expiration of several weeks or months would be used, this relatively small additional delay should not be an issue.

Compression

To make this cache feature complete with the original disk cache, we need to add one final feature: compression. This can be achieved in a similar way as the disk cache by pickling the data and then compressing with zlib, as follows:

import pickle
import zlib
from bson.binary import Binary

class MongoCache:
    def __getitem__(self, url):
        record = self.db.webpage.find_one({'_id': url})
        if record:      
            return pickle.loads(zlib.decompress(record['result']))
        else:
            raise KeyError(url + ' does not exist')

    def __setitem__(self, url, result):
        record = {
            'result': Binary(zlib.compress(pickle.dumps(result))),
            'timestamp': datetime.utcnow()
        }
        self.db.webpage.update(
            {'_id': url}, {'$set': record}, upsert=True)

Testing the cache

The source code for the MongoCache class is available at https://bitbucket.org/wswp/code/src/tip/chapter03/mongo_cache.py and as with DiskCache, the cache can be tested with the link crawler by running this script:

$ time python mongo_cache.py
http://example.webscraping.com
http://example.webscraping.com/view/Afghanistan-1
...
http://example.webscraping.com/view/Zimbabwe-252
23m40.302s

$ time python mongo_cache.py
0.378s

The time taken here is double that for the disk cache. However, MongoDB does not suffer from filesystem limitations and will allow us to make a more efficient crawler in the next chapter, which deals with concurrency.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.174.183