Chapter 7. Persistence and Databases

Introduction

Credit: Aaron Watters, Software Consultant

There are three kinds of people in this world: those who can count and those who can’t.

However, there are only two kinds of computer programs: toy programs and programs that interact with some kind of persistent databases. That is to say, most real computer programs must retrieve stored information and record information for future use. These days, this description applies to almost every computer game, which can typically save and restore the state of the game at any time. So when I refer to toy programs, I mean programs written as exercises, or for the fun of programming. Nearly all real programs (such as programs that people get paid to write) have some persistent database storage/retrieval component.

When I was a Fortran programmer in the 1980s, I noticed that although almost every program had to retrieve and store information, they almost always did it using home-grown methods. Furthermore, since the storage and retrieval parts of the program were the least interesting components from the programmer’s point of view, these parts of the program were frequently implemented very sloppily and were hideous sources of intractable bugs. This repeated observation convinced me that the study and implementation of database systems sat at the core of programming pragmatics, and that the state of the art as I saw it then required much improvement.

Later, in graduate school, I was delighted to find an impressive and sophisticated body of work relating to the implementation of database systems. The literature of database systems covered issues of concurrency, fault tolerance, distribution, query optimization, database design, and transaction semantics, among others. In typical academic fashion, many of the concepts had been elaborated to the point of absurdity (such as the silly notion of conditional multivalued dependencies), but much of the work was directly related to the practical implementation of reliable and efficient storage and retrieval systems. The starting point for much of this work was E.F. Codd’s seminal paper, “A Relational Model of Data for Large Shared Data Banks.”[1]

Among my fellow graduate students, and even among most of the faculty, the same body of knowledge was either disregarded or regarded with some scorn. Everyone recognized that knowledge of conventional relational technology could be lucrative, but they generally considered such knowledge to be on the same level as knowing how to write (or more importantly, maintain) COBOL programs. This situation was not helped by the fact that the emerging database interface standard, SQL (which is now very well established), looked like an extension of COBOL and bore little obvious relationship to any modern programming language.

More than a decade later, there is little indication that anything will soon overtake SQL-based relational technology for the majority of data-based applications. In fact, relational-database technology seems more pervasive than ever. The largest software vendors—IBM, Microsoft, and Oracle—all provide various relational-database implementations as crucial components of their core offerings. Other large software firms, such as SAP and PeopleSoft, essentially provide layers of software built on top of a relational-database core.

Generally, relational databases have been augmented rather than replaced. Enterprise software-engineering dogma frequently espouses three-tier systems, in which the bottom tier is a carefully designed relational database, the middle tier defines a view of the database as business objects, and the top tier consists of applications or transactions that manipulate the business objects, with effects that ultimately translate to changes in the underlying relational tables.

Microsoft’s Open Database Connectivity (ODBC) standard provides a common programming API for SQL-based relational databases that permits programs to interact with many different database engines with no or few changes. For example, a Python program could be first implemented using Microsoft Jet[2] as a backend database for testing and debugging purposes. Once the program is stable, it can be put into production use, remotely accessing, say, a backend DB2 database on an IBM mainframe residing on another continent, by changing (at most) one line of code.

Relational databases are not appropriate for all applications. In particular, a computer game or engineering design tool that must save and restore sessions should probably use a more direct method of persisting the logical objects of the program than the flat tabular representation encouraged in relational-database design. However, even in domains such as engineering or scientific information, a hybrid approach that uses some relational methods is often advisable. For example, I have seen a complex relational-database schema for archiving genetic-sequencing information—in which the sequences show up as binary large objects (BLOBs)—but a tremendous amount of important ancillary information can fit nicely into relational tables. But as the reader has probably surmised, I fear, I speak as a relational zealot.

Within the Python world there are many ways of providing persistence and database functionality. My personal favorite is Gadfly, http://gadfly.sourceforge.net/, a simple and minimal SQL implementation that works primarily with in-memory databases. It is my favorite for no other reason than because it is mine, and its biggest advantage is that, if it becomes unworkable for you, it is easy to switch over to another, industrial-strength SQL engine. Many Gadfly users have started an application with Gadfly (because it was easy to use) and switched later (because they needed more).

However, many people may prefer to start by using other SQL implementations such as MySQL, Microsoft Access, Oracle, Sybase, Microsoft SQL Server, SQLite, or others that provide the advantages of an ODBC interface (which Gadfly does not do).

Python provides a standard interface for accessing relational databases: the Python DB Application Programming Interface (Py-DBAPI), originally designed by Greg Stein. Each underlying database API requires a wrapper implementation of the Py-DBAPI, and implementations are available for just about all underlying database interfaces, notably Oracle and ODBC.

When the relational approach is overkill, Python provides built-in facilities for storing and retrieving data. At the most basic level, the programmer can manipulate files directly, as covered in Chapter 2. A step up from files, the marshal module allows programs to serialize data structures constructed from simple Python types (not including, e.g., classes or class instances). marshal has the advantage of being able to retrieve large data structures with blinding speed. The pickle and cPickle modules allow general storage of objects, including classes, class instances, and circular structures. cPickle is so named because it is implemented in C and is consequently quite fast, but it remains slower than marshal. For access to structured data in a somewhat human-readable form, it is also worth considering storing and retrieving data in XML format (taking advantage of Python’s several XML parsing and generation modules), covered in Chapter 12—but this option works best for write once, read many-type applications. Serialized data or XML representations may be stored in SQL databases to create a hybrid approach as well.

While marshal and pickle provide basic serialization and deserialization of structures, the application programmer will frequently desire more functionality, such as transaction support and concurrency control. When the relational model doesn’t fit the application, a direct object database implementation such as the Z-Object Database (ZODB) might be appropriate—see http://zope.org/Products/ZODB3.2.

I must conclude with a plea to those who are dismissive of relational-database technology. Remember that it is successful for good reasons, and it might be worth considering. To paraphrase Churchill:

text = """ Indeed, it has been said that democracy is the worst form of
    government, except for all those others that have been tried
    from time to time. """
import string
for a, b in [("democracy", "SQL"), ("government", "database")]:
    text = string.replace(text, a, b)
print text

7.1. Serializing Data Using the marshal Module

Credit: Luther Blissett

Problem

You want to serialize and reconstruct a Python data structure whose items are fundamental Python objects (e.g., lists, tuples, numbers, and strings but no classes, instances, etc.) as fast as possible.

Solution

If you know that your data is composed entirely of fundamental Python objects (and you only need to support one version of Python, though possibly on several different platforms), the lowest-level, fastest approach to serializing your data (i.e., turning it into a string of bytes, and later reconstructing it from such a string) is via the marshal module. Suppose that data has only elementary Python data types as items, for example:

data = {12:'twelve', 'feep':list('ciao'), 1.23:4+5j, (1,2,3):u'wer'}

You can serialize data to a bytestring at top speed as follows:

import marshal
bytes = marshal.dumps(data)

You can now sling bytes around as you wish (e.g., send it across a network, put it as a BLOB in a database, etc.), as long as you keep its arbitrary binary bytes intact. Then you can reconstruct the data structure from the bytestring at any time:

redata = marshal.loads(bytes)

When you specifically want to write the data to a disk file (as long as the latter is open for binary—not the default text mode—input/output), you can also use the dump function of the marshal module, which lets you dump several data structures to the same file one after the other:

ouf = open('datafile.dat', 'wb')
marshal.dump(data, ouf)
marshal.dump('some string', ouf)
marshal.dump(range(19), ouf)
ouf.close( )

You can later recover from datafile.dat the same data structures you dumped into it, in the same sequence:

inf = open('datafile.dat', 'rb')
a = marshal.load(inf)
b = marshal.load(inf)
c = marshal.load(inf)
inf.close( )

Discussion

Python offers several ways to serialize data (meaning to turn the data into a string of bytes that you can save on disk, put in a database, send across the network, etc.) and corresponding ways to reconstruct the data from such serialized forms. The lowest-level approach is to use the marshal module, which Python uses to write its bytecode files. marshal supports only elementary data types (e.g., dictionaries, lists, tuples, numbers, and strings) and combinations thereof. marshal does not guarantee compatibility from one Python release to another, so data serialized with marshal may not be readable if you upgrade your Python release. However, marshal does guarantee independence from a specific machine’s architecture, so it is guaranteed to work if you’re sending serialized data between different machines, as long as they are all running the same version of Python—similar to how you can share compiled Python bytecode files in such a distributed setting.

marshal’s dumps function accepts any suitable Python data structure and returns a bytestring representing it. You can pass that bytestring to the loads function, which will return another Python data structure that compares equal (==) to the one you originally dumped. In particular, the order of keys in dictionaries is arbitrary in both the original and reconstructed data structures, but order in any kind of sequence is meaningful and is thus preserved. In between the dumps and loads calls, you can subject the bytestring to any procedure you wish, such as sending it over the network, storing it into a database and retrieving it, or encrypting and decrypting it. As long as the string’s binary structure is correctly restored, loads will work fine on it (as stated previously, this is guaranteed only if you use loads under the same Python release with which you originally executed dumps).

When you specifically need to save the data to a file, you can also use marshal’s dump function, which takes two arguments: the data structure you’re dumping and the open file object. Note that the file must be opened for binary I/O (not the default, which is text I/O) and can’t be a file-like object, as marshal is quite picky about it being a true file. The advantage of dump is that you can perform several calls to dump with various data structures and the same open file object: each data structure is then dumped together with information about how long the dumped bytestring is. As a consequence, when you later open the file for binary reading and then call marshal.load, passing the file as the argument, you can reload each previously dumped data structure sequentially, one after the other, at each call to load. The return value of load, like that of loads, is a new data structure that compares equal to the one you originally dumped. (Again, dump and load work within one Python release—no guarantee across releases.)

Those accustomed to other languages and libraries offering “serialization” facilities may be wondering if marshal imposes substantial practical limits on the size of objects you can serialize or deserialize. Answer: Nope. Your machine’s memory might, but as long as everything fits comfortably in memory, marshal imposes practically no further limit.

See Also

Recipe 7.2 for cPickle, the big brother of marshal; documentation on the marshal standard library module in the Library Reference and in Python in a Nutshell.

7.2. Serializing Data Using the pickle and cPickle Modules

Credit: Luther Blissett

Problem

You want to serialize and reconstruct, at a reasonable speed, a Python data structure, which may include both fundamental Python object as well as classes and instances.

Solution

If you don’t want to assume that your data is composed only of fundamental Python objects, or you need portability across versions of Python, or you need to transmit the serialized form as text, the best way of serializing your data is with the cPickle module. (The pickle module is a pure-Python equivalent and totally interchangeable, but it’s slower and not worth using except if you’re missing cPickle.) For example, say you have:

data = {12:'twelve', 'feep':list('ciao'), 1.23:4+5j, (1,2,3):u'wer'}

You can serialize data to a text string:

import cPickle
text = cPickle.dumps(data)

or to a binary string, a choice that is faster and takes up less space:

bytes = cPickle.dumps(data, 2)

You can now sling text or bytes around as you wish (e.g., send across a network, include as a BLOB in a database—see Recipe 7.10, Recipe 7.11, and Recipe 7.12) as long as you keep text or bytes intact. In the case of bytes, it means keeping the arbitrary binary bytes intact. In the case of text, it means keeping its textual structure intact, including newline characters. Then you can reconstruct the data at any time, regardless of machine architecture or Python release:

redata1 = cPickle.loads(text)
redata2 = cPickle.loads(bytes)

Either call reconstructs a data structure that compares equal to data. In particular, the order of keys in dictionaries is arbitrary in both the original and reconstructed data structures, but order in any kind of sequence is meaningful, and thus it is preserved. You don’t need to tell cPickle.loads whether the original dumps used text mode (the default, also readable by some very old versions of Python) or binary (faster and more compact)—loads figures it out by examining its argument’s contents.

When you specifically want to write the data to a file, you can also use the dump function of the cPickle module, which lets you dump several data structures to the same file one after the other:

ouf = open('datafile.txt', 'w')
cPickle.dump(data, ouf)
cPickle.dump('some string', ouf)
cPickle.dump(range(19), ouf)
ouf.close( )

Once you have done this, you can recover from datafile.txt the same data structures you dumped into it, one after the other, in the same order:

inf = open('datafile.txt')
a = cPickle.load(inf)
b = cPickle.load(inf)
c = cPickle.load(inf)
inf.close( )

You can also pass cPickle.dump a third argument with a value of 2 to tell cPickle.dump to serialize the data in binary form (faster and more compact), but the data file must then be opened for binary I/O, not in the default text mode, both when you originally dump to the file and when you later load from the file.

Discussion

Python offers several ways to serialize data (i.e., make the data into a string of bytes that you can save on disk, save in a database, send across the network, etc.) and corresponding ways to reconstruct the data from such serialized forms. Typically, the best approach is to use the cPickle module. A pure-Python equivalent, called pickle (the cPickle module is coded in C as a Python extension) is substantially slower, and the only reason to use it is if you don’t have cPickle (e.g., with a Python port onto a mobile phone with tiny storage space, where you saved every byte you possibly could by installing only an indispensable subset of Python’s large standard library). However, in cases where you do need to use pickle, rest assured that it is completely interchangeable with cPickle: you can pickle with either module and unpickle with the other one, without any problems whatsoever.

cPickle supports most elementary data types (e.g., dictionaries, lists, tuples, numbers, strings) and combinations thereof, as well as classes and instances. Pickling classes and instances saves only the data involved, not the code. (Code objects are not even among the types that cPickle knows how to serialize, basically because there would be no way to guarantee their portability across disparate versions of Python. See Recipe 7.6 for a way to serialize code objects, as long as you don’t need the cross-version guarantee.) See Recipe 7.4 for more about pickling classes and instances.

cPickle guarantees compatibility from one Python release to another, as well as independence from a specific machine’s architecture. Data serialized with cPickle will still be readable if you upgrade your Python release, and pickling is also guaranteed to work if you’re sending serialized data between different machines.

The dumps function of cPickle accepts any Python data structure and returns a text string representing it. If you call dumps with a second argument of 2, dumps returns an arbitrary bytestring instead: the operation is faster, and the resulting string takes up less space. You can pass either the text or the bytestring to the loads function, which will return another Python data structure that compares equal (==) to the one you originally dumped. In between the dumps and loads calls, you can subject the text or bytestring to any procedure you wish, such as sending it over the network, storing it in a database and retrieving it, or encrypting and decrypting it. As long as the string’s textual or binary structure is correctly restored, loads will work fine on it (even across platforms and in future releases). If you need to produce data readable by old (pre-2.3) versions of Python, consider using 1 as the second argument: operation will be slower, and the resulting strings will not be as compact as those obtained by using 2, but the strings will be unpicklable by old Python versions as well as current and future ones.

When you specifically need to save the data into a file, you can also use cPickle’s dump function, which takes two arguments: the data structure you’re dumping and the open file or file-like object. If the file is opened for binary I/O, rather than the default (text I/O), then by giving dump a third argument of 2, you can ask for binary format, which is faster and takes up less space (again, you can also use 1 in this position to get a binary format that’s neither as compact nor as fast, but is understood by old, pre-2.3 Python versions too). The advantage of dump over dumps is that, with dump, you can perform several calls, one after the other, with various data structures and the same open file object. Each data structure is then dumped with information about how long the dumped string is. Consequently, when you later open the file for reading (binary reading, if you asked for binary format) and then repeatedly call cPickle.load, passing the file as the argument, each data structure previously dumped is reloaded sequentially, one after the other. The return value of load, like that of loads, is a new data structure that compares equal to the one you originally dumped.

Those accustomed to other languages and libraries offering “serialization” facilities may be wondering whether pickle imposes substantial practical limits on the size of objects you can serialize or deserialize. Answer: Nope. Your machine’s memory might, but as long as everything fits comfortably in memory, pickle practically imposes no further limit.

See Also

Recipe 7.2 and Recipe 7.4; documentation for the standard library module cPickle in the Library Reference and Python in a Nutshell.

7.3. Using Compression with Pickling

Credit: Bill McNeill, Andrew Dalke

Problem

You want to pickle generic Python objects to and from disk in a compressed form.

Solution

Standard library modules cPickle and gzip offer the needed functionality; you just need to glue them together appropriately:

import cPickle, gzip
def save(filename, *objects):
   ''' save objects into a compressed diskfile '''
    fil = gzip.open(filename, 'wb')
    for obj in objects: cPickle.dump(obj, fil, proto=2)
    fil.close( )
def load(filename):
   ''' reload objects from a compressed diskfile '''
    fil = gzip.open(filename, 'rb')
    while True:
        try: yield cPickle.load(fil)
        except EOFError: break
    fil.close( )

Discussion

Persistence and compression, as a general rule, go well together. cPickle protocol 2 saves Python objects quite compactly, but the resulting files can still compress quite well. For example, on my Linux box, open('/usr/dict/share/words').readlines( ) produces a list of over 45,000 strings. Pickling that list with the default protocol 0 makes a disk file of 972 KB, while protocol 2 takes only 716 KB. However, using both gzip and protocol 2, as shown in this recipe, requires only 268 KB, saving a significant amount of space. As it happens, protocol 0 produces a more compressible file in this case, so that using gzip and protocol 0 would save even more space, taking only 252 KB on disk. However, the difference between 268 and 252 isn’t all that meaningful, and protocol 2 has other advantages, particularly when used on instances of new-style classes, so I recommend the mix I use in the functions shown in this recipe.

Whatever protocol you choose to save your data, you don’t need to worry about it when you’re reloading the data. The protocol is recorded in the file together with the data, so cPickle.load can figure out by itself all it needs. Just pass it an instance of a file or pseudo-file object with a read method, and cPickle.load returns each object that was pickled to the file, one after the other, and raises EOFError when the file’s done. In this recipe, we wrap a generator around cPickle.load, so you can simply loop over all recovered objects with a for statement, or, depending on what you need, you can use some call such as list(load('somefile.gz')) to get a list with all recovered objects as its items.

See Also

Modules gzip and cPickle in the Library Reference.

7.4. Using the cPickle Module on Classes and Instances

Credit: Luther Blissett

Problem

You want to save and restore class and instance objects using the cPickle module.

Solution

You often need no special precautions to use cPickle on your classes and their instances. For example, the following works fine:

import cPickle
class ForExample(object):
    def _ _init_ _(self, *stuff):
        self.stuff = stuff
anInstance = ForExample('one', 2, 3)
saved = cPickle.dumps(anInstance)
reloaded = cPickle.loads(saved)
assert anInstance.stuff == reloaded.stuff

However, sometimes there are problems:

anotherInstance = ForExample(1, 2, open('three', 'w'))
wontWork = cPickle.dumps(anotherInstance)

This snippet causes a TypeError: “can’t pickle file objects” exception, because the state of anotherInstance includes a file object, and file objects cannot be pickled. You would get exactly the same exception if you tried to pickle any other container that includes a file object among its items.

However, in some cases, you may be able to do something about it:

class PrettyClever(object):
    def _ _init_ _(self, *stuff):
        self.stuff = stuff
    def _ _getstate_ _(self):
        def normalize(x):
            if isinstance(x, file):
                return 1, (x.name, x.mode, x.tell( ))
            return 0, x
        return [ normalize(x) for x in self.stuff ]
    def _ _setstate_ _(self, stuff):
        def reconstruct(x):
            if x[0] == 0:
                return x[1]
            name, mode, offs = x[1]
            openfile = open(name, mode)
            openfile.seek(offs)
            return openfile
        self.stuff = tuple([reconstruct(x) for x in stuff])

By defining the _ _getstate_ _ and _ _setstate_ _ special methods in your class, you gain fine-grained control about what, exactly, your class’ instances consider to be their state. As long as you can define such state in picklable terms, and reconstruct your instances from the unpickled state in some way that is sufficient for your application, you can make your instances themselves picklable and unpicklable in this way.

Discussion

cPickle dumps class and function objects by name (i.e., through their module’s name and their name within the module). Thus, you can dump only classes defined at module level (not inside other classes and functions). Reloading such objects requires the respective modules to be available for import. Instances can be saved and reloaded only if they belong to such classes. In addition, the instance’s state must also be picklable.

By default, an instance’s state is the contents of the instance’s _ _dict_ _, plus whatever state the instance may get from the built-in type the instance’s class inherits from, if any. For example, an instance of a new-style class that subclasses list includes the list items as part of the instance’s state. cPickle also handles instances of new-style classes that define or inherit a class attribute named _ _slots_ _ (and therefore hold some or all per-instance state in those predefined slots, rather than in a per-instance _ _dict_ _). Overall, cPickle’s default approach is often quite sufficient and satisfactory.

Sometimes, however, you may have nonpicklable attributes or items as part of your instance’s state (as cPickle defines such state by default, as explained in the previous paragraph). In this recipe, for example, I show a class whose instances hold arbitrary stuff, which may include open file objects. To handle this case, your class can define the special method _ _getstate_ _. cPickle calls that method on your object, if your object’s class defines it or inherits it, instead of going directly for the object’s _ _dict_ _ (or possibly _ _slots_ _ and/or built-in type bases).

Normally, when you define the _ _getstate_ _ method, you define the _ _setstate_ _ method as well, as shown in this recipe’s Solution. _ _getstate_ _ can return any picklable object, and that object gets pickled, and later, at unpickling time, passed as _ _setstate_ _’s argument. In this recipe’s Solution, _ _getstate_ _ returns a list that’s similar to the instance’s default state (attribute self.stuff), except that each item is turned into a tuple of two items. The first item in the pair can be set to 0 to indicate that the second one will be taken verbatim, or 1 to indicate that the second item will be used to reconstruct an open file. (Of course, the reconstruction may fail or be unsatisfactory in several ways. There is no general way to save an open file’s state, which is why cPickle itself doesn’t even try. But in the context of our application, we can assume that the given approach will work.) When reloading the instance from pickled form, cPickle calls _ _setstate_ _ with the list of pairs, and _ _setstate_ _ can reconstruct self.stuff by processing each pair appropriately in its nested reconstruct function. This scheme can clearly generalize to getting and restoring state that may contain various kinds of normally unpicklable objects—just be sure to use different numbers to tag each of the various kinds of “nonverbatim” pairs you need to support.

In one particular case, you can define _ _getstate_ _ without defining _ _setstate_ _: _ _getstate_ _ must then return a dictionary, and reloading the instance from pickled form uses that dictionary just as the instance’s _ _dict_ _ would normally be used. Not running your own code at reloading time is a serious hindrance, but it may come in handy when you want to use _ _getstate_ _, not to save otherwise unpicklable state but rather as an optimization. Typically, this optimization opportunity occurs when your instance caches results that it can recompute if they’re absent, and you decide it’s best not to store the cache as a part of the instance’s state. In this case, you should define _ _getstate_ _ to return a dictionary that’s the indispensable subset of the instance’s _ _dict_ _. (See Recipe 4.13) for a simple and handy way to “subset a dictionary”.)

Defining _ _getstate_ _ (and then, normally, also _ _setstate_ _) also gives you a further important bonus, besides the pickling support: if a class offers these methods but doesn’t offer special methods _ _copy_ _ or _ _deepcopy_ _, then the methods are also used for copying, both shallowly and deeply, as well as for serializing. The state data returned by _ _getstate_ _ is deep-copied if and only if the object is being dee-copied, but, other than this distinction, shallow and deep copies work very similarly when they are implemented through _ _getstate_ _. See Recipe 4.1 for more information about how a class can control the way its instances are copied, shallowly or deeply.

With either the default pickling/unpickling approach, or your own _ _getstate_ _ and _ _setstate_ _, the instance’s special method _ _init_ _ is not called when the instance is getting unpickled. If the most convenient way for you to reconstruct an instance is to call the _ _init_ _ method with appropriate parameters, then you may want to define the special method _ _getinitargs_ _, instead of _ _getstate_ _. In this case, cPickle calls this method without arguments: the method must return a pickable tuple, and at unpickling time, cPickle calls _ _init_ _ with the arguments that are that tuple’s items. _ _getinitargs_ _, like _ _getstate_ _ and _ _setstate_ _, can also be used for copying.

The Library Reference for the pickle and copy_reg modules details even subtler things you can do when pickling and unpickling, as well as the thorny security issues that are likely to arise if you ever stoop to unpickling data from untrusted sources. (Executive summary: don’t do that—there is no way Python can protect you if you do.) However, the techniques I’ve discussed here should suffice in almost all practical cases, as long as the security aspects of unpickling are not a problem (and if they are, the only practical suggestion is: forget pickling!).

See Also

Recipe 7.2; documentation for the standard library module cPickle in the Library Reference and Python in a Nutshell.

7.5. Holding Bound Methods in a Picklable Way

Credit: Peter Cogolo

Problem

You need to pickle an object, but that object holds (as an attribute or item) a bound method of another object, and bound methods are not picklable.

Solution

Say you have the following objects:

import cPickle
class Greeter(object):
    def _ _init_ _(self, name):
        self.name = name
    def greet(self):
        print 'hello', self.name
class Repeater(object):
    def _ _init_ _(self, greeter):
        self.greeter = greeter
    def greet(self):
        self.greeter( )
        self.greeter( )
r = Repeater(Greeter('world').greet)

Were it not for the fact that r holds a bound method as its greeter attribute, you could pickle r very simply:

s = cPickle.dumps(r)

However, upon encountering the bound method, this call to cPickle.dumps raises a TypeError. One simple solution is to have each instance of class Repeater hold, not a bound method directly, but rather a picklable wrapper to it. For example:

class picklable_boundmethod(object):
    def _ _init_ _(self, mt):
        self.mt = mt
    def _ _getstate_ _(self):
        return self.mt.im_self, self.mt.im_func._ _name_ _
    def _ _setstate_ _(self, (s,fn)):
        self.mt = getattr(s, fn)
    def _ _call_ _(self, *a, **kw):
        return self.mt(*a, **kw)

Now, changing Repeater._ _init_ _’s body to self.greeter = picklable_boundmethod(greeter) makes the previous snippet work.

Discussion

The Python Standard Library pickle module (just like its faster equivalent cousin cPickle) pickles functions and classes by name—this implies, in particular, that only functions defined at the top level of a module can be pickled (the pickling of such a function, in practice, contains just the names of the module and function).

If you have a graph of objects that hold each other, not directly, but via one another’s bound methods (which is often a good idea in Python), this limitation can make the whole graph unpicklable. One solution might be to teach pickle how to serialize bound methods, along the same lines as described in Recipe 7.6. Another possible solution is to define appropriate _ _getstate_ _ and _ _setstate_ _ methods to turn bound methods into something picklable at dump time and rebuild them at load time, along the lines described in Recipe 7.4. However, this latter possibility is not a good factorization when you have several classes whose instances hold bound methods.

This recipe pursues a simpler idea, based on holding bound methods, not directly, but via the picklable_boundmethod wrapper class. picklable_boundmethod is written under the assumption that the only thing you usually do with a bound method is to call it, so it only delegates _ _call_ _ functionality specifically. (You could, in addition, also use _ _getattr_ _, in order to delegate other attribute accesses.)

In normal operation, the fact that you’re holding an instance of picklable_boundmethod rather than holding the bound method object directly is essentially transparent. When pickling time comes, special method _ _getstate_ _ of picklable_boundmethod comes into play, as previously covered in Recipe 7.4. In the case of picklable_boundmethod, _ _getstate_ _ returns the object to which the bound method belongs and the function name of the bound method. Later, at unpickling time, _ _setstate_ _ recovers an equivalent bound method from the reconstructed object by using the getattr built-in for that name. This approach isn’t infallible because an object might hold its methods under assumed names (different from the real function names of the methods). However, assuming you’re not specifically doing something weird for the specific purpose of breaking picklable_boundmethod’s functionality, you shouldn’t ever run into this kind of obscure problem!

See Also

Library Reference and Python in a Nutshell docs for modules pickle and cPickle, bound-method objects, and the getattr built-in.

7.6. Pickling Code Objects

Credit: Andres Tremols, Peter Cogolo

Problem

You want to be able to pickle code objects, but this functionality is not supported by the standard library’s pickling modules.

Solution

You can extend the abilities of the pickle (or cPickle) module by using module copy_reg. Just make sure the following module has been imported before you pickle code objects, and has been imported, or is available to be imported, when you’re unpickling them:

import new, types, copy_reg
def code_ctor(*args):
    # delegate to new.code the construction of a new code object
    return new.code(*args)
def reduce_code(co):
    # a reductor function must return a tuple with two items: first, the
    # constructor function to be called to rebuild the argument object
    # at a future de-serialization time; then, the tuple of arguments
    # that will need to be passed to the constructor function.
    if co.co_freevars or co.co_cellvars:
        raise ValueError, "Sorry, cannot pickle code objects from closures"
    return code_ctor, (co.co_argcount, co.co_nlocals, co.co_stacksize,
        co.co_flags, co.co_code, co.co_consts, co.co_names,
        co.co_varnames, co.co_filename, co.co_name, co.co_firstlineno,
        co.co_lnotab)
# register the reductor to be used for pickling objects of type 'CodeType'
copy_reg.pickle(types.CodeType, reduce_code)
if _ _name_ _ == '_ _main_ _':
    # example usage of our new ability to pickle code objects
    import cPickle
    # a function (which, inside, has a code object, of course)
    def f(x): print 'Hello,', x
    # serialize the function's code object to a string of bytes
    pickled_code = cPickle.dumps(f.func_code)
    # recover an equal code object from the string of bytes
    recovered_code = cPickle.loads(pickled_code)
    # build a new function around the rebuilt code object
    g = new.function(recovered_code, globals( ))
    # check what happens when the new function gets called
    g('world')

Discussion

The Python Standard Library pickle module (just like its faster equivalent cousin cPickle) pickles functions and classes by name. There is no pickling of the code objects containing the compiled bytecode that, when run, determines almost every aspect of functions’ (and methods') behavior. In some situations, you’d rather pickle everything by value, so that all the relevant stuff can later be retrieved from the pickle, rather than having to have module files around for some of it. Sometimes you can solve such problems by using marshaling rather than pickling, since marshal does let you serialize code objects, but marshal has limitations on many other issues. For example, you cannot marshal instances of classes you have coded. (Once you’re serializing code objects, which are specific to a given version of Python, pickle will share one key limitation of marshal: no guaranteed ability to save and later reload data across different versions of Python.)

An alternative approach is to take advantage of the possibility, which the Python Standard Library allows, to extend the set of types known to pickle. Basically, you can “teach” pickle how to save and reload code objects; this, in turn, lets you pickle by value, rather than “by name”, such objects as functions and classes. (The code in this recipe’s Solution under the if _ _name_ _ == '_ _main_ _' guard essentially shows how to extend pickle for a function.)

To teach pickle about some new type, use module copy_reg, which is also part of the Python Standard Library. Through function copy_reg.pickle, you register the reduction function to use for instances of a given type. A reduction function takes as its argument an instance to be pickled and returns a tuple with two items: a constructor function, which will be called to reconstruct the instance, and a tuple of arguments, which will be passed to the constructor function. (A reduction function may also return other kinds of results, but for this recipe’s purposes a two-item tuple suffices.)

The module in this recipe defines function reduce_code, then registers it as the reduction function for objects of type types.CodeType—that is, code objects. When reduce_code gets called, it first checks whether its code object co comes from a closure (functions nested inside each other), because it just can’t deal with this eventuality—I’ve been unable to find a way that works, so in this case, reduce_code just raises an exception to let the user know about the problem.

In normal cases, reduce_code returns code_ctor as the constructor and a tuple made up of all of co’s attributes as the arguments tuple for the constructor. When a code object is reloaded from a pickle, code_ctor gets called with those arguments and simply passes the call on to the new.code callable, which is the true constructor for code arguments. Unfortunately, reduce_code cannot return new.code itself as the first item in its result tuple, because new.code is a built-in (a C-coded callable) but is not available through a built-in name. So, basically, the role of code_ctor is to provide a name for the (by-name) pickling of new.code.

The if _ _name_ _ == '_ _main_ _' part of the recipe provides a typical toy usage example—it pickles a code object to a string, recovers a copy of it from the pickle string, and builds and calls a function around that code object. A more typical use case for this recipe’s functionality, of course, will do the pickling in one script and the unpickling in another. Assume that the module in this recipe has been saved as file reco.py somewhere on Python’s sys.path, so that it can be imported by Python scripts and other modules. You could then have a script that imports reco and thus becomes able to pickle code objects, such as:

import reco, pickle
def f(x):
    print 'Hello,', x
pickle.dump(f.func_code, open('saved.pickle','wb'))

To unpickle and use that code object, an example script might be:

import new, cPickle
c = cPickle.load(open('saved.pickle','rb'))
g = new.function(c, globals( ))
g('world')

Note that the second script does not need to import reco—the import will happen automatically when needed (part of the information that pickle saves in saved.pickle is that, in order to reconstruct the pickled object therein, it needs to call reco.code_ctor; so, it also knows it needs to import reco). I’m also showing that you can use modules pickle and cPickle interchangeably. Pickle is faster, but there are no other differences, and in particular, you can use one module to pickle objects and the other one to unpickle them, if you wish.

See Also

Modules pickle, cPickle, and copy_reg in the Library Reference and Python in a Nutshell.

7.7. Mutating Objects with shelve

Credit: Luther Blissett

Problem

You are using the standard module shelve. Some of the values you have shelved are mutable objects, and you need to mutate these objects.

Solution

The shelve module offers a kind of persistent dictionary—an important niche between the power of relational-database engines and the simplicity of marshal, pickle, dbm, and similar file formats. However, you should be aware of a typical trap you need to avoid when using shelve. Consider the following interactive Python session:

>>> import shelve
>>> # Build a simple sample shelf
>>> she = shelve.open('try.she', 'c')
>>> for c in 'spam': she[c] = {c:23}
...
>>> for c in she.keys( ): print c, she[c]
...p {'p': 23}
               s {'s': 23}
               a {'a': 23}
               m {'m': 23}
>>> she.close( )

We’ve created the shelve file, added some data to it, and closed it. Good—now we can reopen it and work with it:

>>> she=shelve.open('try.she', 'c')
>>> she['p']{'p': 23}
>>> she['p']['p'] = 42
>>> she['p']
{'p': 23}

What’s going on here? We just set the value to 42, but our setting didn’t take in the shelve object! The problem is that we were working with a temporary object that shelve gave us, not with the “real thing”. shelve, when we open it with default options, like here, doesn’t track changes to such temporary objects. One reasonable solution is to bind a name to this temporary object, do our mutation, and then assign the mutated object back to the appropriate item of shelve:

>>> a = she['p']
>>> a['p'] = 42
>>> she['p'] = a
>>> she['p']{'p': 42}
>>> she.close( )

We can verify that the change was properly persisted:

>>> she=shelve.open('try.she','c')
>>> for c in she.keys( ): print c,she[c]
...p {'p': 42}
               s {'s': 23}
               a {'a': 23}
               m {'m': 23}

A simpler solution is to open the shelve object with the writeback option set to True:

>>> she = shelve.open('try.she', 'c', writeback=True)

The writeback option instructs shelve to keep track of all the objects it gets from the file and write them all back to the file before closing it, just in case they have been modified in the meantime. While simple, this approach can be quite expensive, particularly in terms of memory consumption. Specifically, if we read many objects from a shelve object opened with writeback=True, even if we only modify a few of them, shelve is going to keep them all in memory, since it can’t tell in advance which one we may be about to modify. The previous approach, where we explicitly take responsibility to notify shelve of any changes (by assigning the changed objects back to the place they came from), requires more care on our part, but repays that care by giving us much better performance.

Discussion

The standard Python module shelve can be quite convenient in many cases, but it hides a potentially nasty trap, admittedly well documented in Python’s online docs but still easy to miss. Suppose you’re shelving mutable objects, such as dictionaries or lists. Naturally, you are quite likely to want to mutate some of those objects—for example, by calling mutating methods (append on a list, update on a dictionary, etc.) or by assigning a new value to an item or attribute of the object. However, when you do this, the change doesn’t occur in the shelve object. This is because we actually mutate a temporary object that the shelve object has given us as the result of shelve’s own _ _getitem_ _ method, but the shelve object, by default, does not keep track of that temporary object, nor does it care about it once it returns it to us.

As shown in the recipe, one solution is to bind a name to the temporary object obtained by keying into the shelf, doing whatever mutations are needed to the object via the name, then assigning the newly mutated object back to the appropriate item of the shelve object. When you assign to a shelve object’s item, the shelve object’s _ _setitem_ _ method gets invoked, and it appropriately updates the shelve object itself, so that the change does occur.

Alternatively, you can add the flag writeback=True at the time you open the shelve object, and then shelve keeps track of every object it hands you, saving them all back to disk at the end. This approach may save you quite a bit of fussy and laborious coding, but take care: if you read many items of the shelve object and only modify a few of them, the writeback approach can be exceedingly costly, particularly in terms of memory consumption. When opened with writeback=True, shelve will keep in memory any item it has ever handed you, and save them all to disk at the end, since it doesn’t have a reliable way to tell which items you may be about to modify, nor, in general, even which items you have actually modified by the time you close the shelve object. The recommended approach, unless you’re going to modify just about every item you read (or unless the shelve object in question is small enough compared with your available memory that you don’t really care), is the previous one: bind a name to the items that you get from a shelve object with intent to modify them, and assign each item back into the shelve object once you’re done mutating that item.

See Also

Recipe 7.1 and Recipe 7.2 for alternative serialization approaches; documentation for the shelve standard library module in the Library Reference and Python in a Nutshell.

7.8. Using the Berkeley DB Database

Credit: Farhad Fouladi

Problem

You want to persist some data, exploiting the simplicity and good performance of the Berkeley DB database library.

Solution

If you have previously installed Berkeley DB on your machine, the Python Standard Library comes with package bsddb (and optionally bsddb3, to access Berkeley DB release 3.2 databases) to interface your Python code with Berkeley DB. To get either bsddb or, lacking it, bsddb3, use a try/except on import:

try:
    from bsddb import db                  # first try release 4
except ImportError:
    from bsddb3 import db                 # not there, try release 3 instead
print db.DB_VERSION_STRING
# emits, e.g:Sleepycat Software: Berkeley DB 4.1.25: (December 19, 2002)

To create a database, instantiate a db.DB object, then call its method open with appropriate parameters, such as:

adb = db.DB( )
adb.open('db_filename', dbtype=db.DB_HASH, flags=db.DB_CREATE)

db.DB_HASH is just one of several access methods you may choose when you create a database—a popular alternative is db.DB_BTREE, to use B+tree access (handy if you need to get records in sorted order). You may make an in-memory database, without an underlying file for persistence, by passing None instead of a filename as the first argument to the open method.

Once you have an open instance of db.DB, you can add records, each composed of two strings, key and data:

for i, w in enumerate('some words for example'.split( )):
    adb.put(w, str(i))

You can access records via a cursor on the database:

def irecords(curs):
    record = curs.first( )
    while record:
        yield record
        record = curs.next( )
for key, data in irecords(adb.cursor( )):
    print 'key=%r, data=%r' % (key, data)
#emits (the order may vary):
# key='some', data='0'
# key='example', data='3'
# key='words', data='1'
# key='for', data='2'

When you’re done, you close the database:

adb.close( )

At any future time, in the same or another Python program, you can reopen the database by giving just its filename as the argument to the open method of a newly created db.DB instance:

the_same_db = db.DB( )
the_same_db.open('db_filename')

and work on it again in the same ways:

the_same_db.put('skidoo', '23')          # add a record
the_same_db.put('words', 'sweet')        # replace a record
for key, data in irecords(the_same_db.cursor( )):
    print 'key=%r, data=%r' % (key, data)
# emits (the order may vary):
#key='some', data='0'
# key='example', data='3'
# key='words', data='sweet'
# key='for', data='2'
# key='skidoo', data='23'

Again, remember to close the database when you’re done:

the_same_db.close( )

Discussion

The Berkeley DB is a popular open source database. It does not support SQL, but it’s simple to use, offers excellent performance, and gives you a lot of control over exactly what happens, if you care to exert it, through a huge array of options, flags, and methods. Berkeley DB is just as accessible from many other languages as from Python: for example, you can perform some changes or queries with a Python program, and others with a separate C program, on the same database file, using the same underlying open source library that you can freely download from Sleepycat.

The Python Standard Library shelve module can use the Berkeley DB as its underlying database engine, just as it uses cPickle for serialization. However, shelve does not let you take advantage of the ability to access a Berkeley DB database file from several different languages, exactly because the records are strings produced by pickle.dumps, and languages other than Python can’t easily deal with them. Accessing the Berkeley DB directly with bsddb also gives you access to many advanced functionalities of the database engine that shelve simply doesn’t support.

For example, creating a database with an access method of db.DB_HASH, as shown in the recipe, may give maximum performance, but, as you’ll have noticed when listing all records with the generator irecords that is also presented in the recipe, hashing puts records in apparently random, unpredictable order. If you need to access records in sorted order, you can use an access method of db.DB_BTREE instead. Berkeley DB also supports more advanced functionality, such as transactions, which you can enable through direct access but not via anydbm or shelve.

For detailed documentation about all functionality of the Python Standard Library bsddb package, see http://pybsddb.sourceforge.net/bsddb3.html. For documentation, downloads, and more of the Berkeley DB itself, see http://www.sleepycat.com/.

See Also

Library Reference and Python in a Nutshell docs for modules anydbm, shelve, and bsddb; http://pybsddb.sourceforge.net/bsddb3.html for many more details about bsddb and bsddb3; http://www.sleepycat.com/ for downloads of, and very detailed documentation on, the Berkeley DB itself.

7.9. Accessing a MySQL Database

Credit: Mark Nenadov

Problem

You need to access a MySQL database.

Solution

The MySQLdb module makes this task extremely easy:

import MySQLdb
# Create a connection object, then use it to create a cursorcon = MySQLdb.connect(host="127.0.0.1", port=3306, 
    user="joe", passwd="egf42", db="tst")
cursor = con.cursor( )
# Execute an SQL string
sql = "SELECT * FROM Users"
cursor.execute(sql)
# Fetch all results from the cursor into a sequence and close the connection
results = cursor.fetchall( )
con.close( )

Discussion

MySQLdb is at http://sourceforge.net/projects/mysql-python. It is a plain and simple implementation of the Python DB API 2.0 that is suitable for Python 2.3 (and some older versions, too) and MySQL versions 3.22 to 4.0. MySQLdb, at the time of this writing, did not yet officially support Python 2.4. However, if you have the right C compiler installation to build Python extensions (as should be the case for all Linux, Mac OS X, and other Unix users, and many Windows developers), the current version of MySQLdb does in fact build from sources, install, and work just fine, with Python 2.4. A newer version of MySQLdb is in the works, with official support for Python 2.3 or later and MySQL 4.0 or later.

As with all other Python DB API implementations (once you have downloaded and installed the needed Python extension and have the database engine it needs up and running), you start by importing the module and calling the connect function with suitable parameters. The keyword parameters you can pass when calling connect depend on the database involved: host (defaulting to the local host), user, passwd (password), and db (name of the database) are typical. In the recipe, I explicitly pass the default local host’s IP address and the default MySQL port (3306), just to show that you can specify parameters explicitly even when you’re passing their default values (e.g., to make your source code clearer and more readable and maintainable).

The connect function returns a connection object, and you can proceed to call methods on this object; when you are done, call the close method. The method you most often call on a connection object is cursor, which returns a cursor object, which is what you use to send SQL commands to the database and fetch the commands’ results. The underlying MySQL database engine does not in fact support SQL cursors, but that’s no problem—the MySQLdb module emulates them on your behalf, quite transparently, for the limited cursor needs of the Python DB API 2.0. Of course, this doesn’t mean that you can use SQL phrases like WHERE CURRENT OF CURSOR with a database that does not offer cursors! Once you have a cursor object in hand, you can call methods on it. The recipe uses the execute method to execute an SQL statement, and then the fetchall method to obtain all results as a sequence of tuples—one tuple per row in the result. You can use many refinements, but these basic elements of the Python DB API’s functionality already suffice for many tasks.

See Also

The Python-MySQL interface module (http://sourceforge.net/projects/mysql-python); the Python DB API (http://www.python.org/topics/database/DatabaseAPI-2.0.html); DB API documentation in Python in a Nutshell.

7.10. Storing a BLOB in a MySQL Database

Credit: Luther Blissett

Problem

You need to store a binary large object (BLOB) in a MySQL database.

Solution

The MySQLdb module does not support full-fledged placeholders, but you can make do with the module’s escape_string function:

import MySQLdb, cPickle
# Connect to a DB, e.g., the test DB on your localhost, and get a cursor
connection = MySQLdb.connect(db="test")
cursor = connection.cursor( )
# Make a new table for experimentation
cursor.execute("CREATE TABLE justatest (name TEXT, ablob BLOB)")
try:
    # Prepare some BLOBs to insert in the table
    names = 'aramis', 'athos', 'porthos'
    data = {  }
    for name in names:
        datum = list(name)
        datum.sort( )
        data[name] = cPickle.dumps(datum, 2)
    # Perform the insertions
    sql = "INSERT INTO justatest VALUES(%s, %s)"
    for name in names:
        cursor.execute(sql, (name, MySQLdb.escape_string(data[name])) )
    # Recover the data so you can check back
    sql = "SELECT name, ablob FROM justatest ORDER BY name"
    cursor.execute(sql)
    for name, blob in cursor.fetchall( ):
        print name, cPickle.loads(blob), cPickle.loads(data[name])
finally:
    # Done. Remove the table and close the connection.
    cursor.execute("DROP TABLE justatest")
    connection.close( )

Discussion

MySQL supports binary data (BLOBs and variations thereof), but you should be careful when communicating such data via SQL. Specifically, when you use a normal INSERT SQL statement and need to have binary strings among the VALUES you’re inserting, you have to escape some characters in the binary string according to MySQL’s own rules. Fortunately, you don’t have to figure out those rules for yourself: MySQL supplies a function that does the needed escaping, and MySQLdb exposes it to your Python programs as the escape_string function.

This recipe shows a typical case: the BLOBs you’re inserting come from cPickle.dumps, so they may represent almost arbitrary Python objects (although, in this case, we’re just using them for a few lists of characters). The recipe is purely demonstrative and works by creating a table and dropping it at the end (using a try/finally statement to ensure that finalization is performed even if the program should terminate because of an uncaught exception). With recent versions of MySQL and MySQLdb, you don’t even need to call the escape_string function anymore, so you can change the relevant statement to the simpler:

cursor.execute(sql, (name, data[name]))

See Also

Recipe 7.11 and Recipe 7.12 for PostgreSQL-oriented and SQLite-oriented solutions to the same problem; the MySQL home page (http://www.mysql.org); the Python/MySQL interface module (http://sourceforge.net/projects/mysql-python).

7.11. Storing a BLOB in a PostgreSQL Database

Credit: Luther Blissett

Problem

You need to store a BLOB in a PostgreSQL database.

Solution

PostgreSQL 7.2 and later supports large objects, and the psycopg module supplies a Binary escaping function:

import psycopg, cPickle
# Connect to a DB, e.g., the test DB on your localhost, and get a cursor
connection = psycopg.connect("dbname=test")
cursor = connection.cursor( )
# Make a new table for experimentation
cursor.execute("CREATE TABLE justatest (name TEXT, ablob BYTEA)")
try:
    # Prepare some BLOBs to insert in the table
    names = 'aramis', 'athos', 'porthos'
    data = {  }
    for name in names:
        datum = list(name)
        datum.sort( )
        data[name] = cPickle.dumps(datum, 2)
    # Perform the insertions
    sql = "INSERT INTO justatest VALUES(%s, %s)"
    for name in names:
        cursor.execute(sql, (name, psycopg.Binary(data[name])) )
    # Recover the data so you can check back
    sql = "SELECT name, ablob FROM justatest ORDER BY name"
    cursor.execute(sql)
    for name, blob in cursor.fetchall( ):
        print name, cPickle.loads(blob), cPickle.loads(data[name])
finally:
    # Done. Remove the table and close the connection.
    cursor.execute("DROP TABLE justatest")
    connection.close( )

Discussion

PostgreSQL supports binary data (BYTEA and variations thereof), but you should be careful when communicating such data via SQL. Specifically, when you use a normal INSERT SQL statement and need to have binary strings among the VALUES you’re inserting, you have to escape some characters in the binary string according to PostgreSQL’s own rules. Fortunately, you don’t have to figure out those rules for yourself: PostgreSQL supplies functions that do all the needed escaping, and psycopg exposes such a function to your Python programs as the Binary function. This recipe shows a typical case: the BYTEAs you’re inserting come from cPickle.dumps, so they may represent almost arbitrary Python objects (although, in this case, we’re just using them for a few lists of characters). The recipe is purely demonstrative and works by creating a table and dropping it at the end (using a try/finally statement to ensure finalization is performed even if the program should terminate because of an uncaught exception).

Earlier PostgreSQL releases limited to a few kilobytes the amount of data you could store in a normal field of the database. To store really large objects, you had to use roundabout techniques to load the data into the database (such as PostgreSQL’s nonstandard SQL function LO_IMPORT to load a data file as an object, which requires superuser privileges and data files that reside on the machine running the Postgre-SQL Server) and store a field of type OID in the table to be used later for indirect recovery of the data. Fortunately, none of these techniques are necessary anymore: since Release 7.1 (the current release at the time of writing is 8.0), PostgreSQL embodies the results of project TOAST, which removed the limitations on field-storage size and therefore the need for peculiar indirection. Module psycopg supplies the handy Binary function to let you escape any binary string of bytes into a form acceptable for placeholder substitution in INSERT and UPDATE SQL statements.

See Also

Recipe 7.10 and Recipe 7.12 for MySQL-oriented and SQLite-oriented solutions to the same problem; PostgresSQL’s home page (http://www.postgresql.org/); the Python/PostgreSQL module (http://initd.org/software/psycopg).

7.12. Storing a BLOB in a SQLite Database

Credit: John Barham

Problem

You need to store a BLOB in an SQLite database.

Solution

The PySQLite Python extension offers function sqlite.encode to let you insert binary strings in SQLite databases. You can also build a small adapter class based on that function:

import sqlite, cPickle
class Blob(object):
    ''' automatic converter for binary strings '''
    def _ _init_ _(self, s): self.s = s
    def _quote(self): return "'%s'" % sqlite.encode(self.s)
# make a test database in memory, get a cursor on it, and make a table
connection = sqlite.connect(':memory:')
cursor = connection.cursor( )
cursor.execute("CREATE TABLE justatest (name TEXT, ablob BLOB)")
# Prepare some BLOBs to insert in the table
names = 'aramis', 'athos', 'porthos'
data = {  }
for name in names:
    datum = list(name)
    datum.sort( )
    data[name] = cPickle.dumps(datum, 2)
# Perform the insertions
sql = 'INSERT INTO justatest VALUES(%s, %s)'
for name in names:
    cursor.execute(sql, (name, Blob(data[name])) )
# Recover the data so you can check back
sql = 'SELECT name, ablob FROM justatest ORDER BY name'
cursor.execute(sql)
for name, blob in cursor.fetchall( ):
    print name, cPickle.loads(blob), cPickle.loads(data[name])
# Done, close the connection (would be no big deal if you didn't, but...)
connection.close( )

Discussion

SQLite does not directly support binary data, but it still lets you declare such types for fields in a CREATE TABLE DDL statement. The PySQLite Python extension uses the declared types of fields to convert field values appropriately to Python values when you fetch data after an SQL SELECT from an SQLite database. However, you still need to be careful when communicating binary string data via SQL.

Specifically, when you use INSERT or UPDATE SQL statements, and need to have binary strings among the VALUES you’re passing, you need to escape some characters in the binary string according to SQLite’s own rules. Fortunately, you don’t have to figure out those rules for yourself: SQLite supplies the function to do the needed escaping, and PySQLite exposes that function to your Python programs as the sqlite.encode function. This recipe shows a typical case: the BLOBs you’re inserting come from cPickle.dumps, so they may represent almost arbitrary Python objects (although, in this case, we’re just using them for a few lists of characters). The recipe is purely demonstrative and works by creating a database in memory, so that the database is implicitly lost at the end of the script.

While you could perfectly well call sqlite.encode directly on your binary strings at the time you pass them as parameters to a cursor’s execute method, this recipe takes a slightly different tack, defining a Blob class to wrap binary strings and passing instances of that. When PySQLite receives as arguments instances of any class, the class must define a method named _quote, and PySQLite calls that method on each instance, expecting the method to return a string fully ready for insertion into an SQL statement. When you use this approach for more complicated classes of your own, you’ll probably want to pass a decoders keyword argument to the connect method, to associate appropriate decoding functions to specific SQL types. By default, however, the BLOB SQL type is associated with the decoding function sqlite.decode, which is exactly the inverse of sqlite.encode; for the simple Blob class in this recipe, therefore, we do not need to specify any custom decoder, since the default one suits us perfectly well.

See Also

Recipe 7.10 and Recipe 7.11 for MySQL-oriented and PostgreSQL-oriented solutions to the same problem; SQLite’s home page (http://www.sqlite.org/); the PySQLite manual (http://pysqlite.sourceforge.net/manual.html); the SQLite FAQ (“Does SQLite support a BLOB type?”) at http://www.hwaci.com/sw/sqlite/faq.html#q12.

7.13. Generating a Dictionary Mapping Field Names to Column Numbers

Credit: Thomas T. Jenkins

Problem

You want to access data fetched from a DB API cursor object, but you want to access the columns by field name, not by number.

Solution

Accessing columns within a set of database-fetched rows by column index is not very readable, nor is it robust should columns ever get reordered in a rework of the database’s schema (a rare event, but it does occasionally happen). This recipe exploits the description attribute of Python DB API’s cursor objects to build a dictionary that maps column names to index values, so you can use cursor_row[field_dict[fieldname]] to get the value of a named column:

def fields(cursor):   
    """ Given a DB API 2.0 cursor object that has been executed, returns
    a dictionary that maps each field name to a column index, 0 and up. """
    results = {  }
    for column, desc in enumerate(cursor.description):
        results[desc[0]] = column
    return results

Discussion

When you get a set of rows from a call to any of a cursor’s various fetch . . . methods (fetchone, fetchmany, fetchall), it is often helpful to be able to access a specific column in a row by field name and not by column number. This recipe shows a function that takes a DB API 2.0 cursor object and returns a dictionary with column numbers keyed by field names.

Here’s a usage example (assuming you put this recipe’s code in a module that you call dbutils.py somewhere on your Python sys.path). You must start with conn being a connection object for any DB API 2-compliant Python module.

>>> c = conn.cursor( )
>>> c.execute('''select * from country_region_goal
...              where crg_region_code is null''')
>>> import pprint
>>> pp = pprint.pprint
>>> pp(c.description)(('CRG_ID', 4, None, None, 10, 0, 0),
               ('CRG_PROGRAM_ID', 4, None, None, 10, 0, 1),
               ('CRG_FISCAL_YEAR', 12, None, None, 4, 0, 1),
               ('CRG_REGION_CODE', 12, None, None, 3, 0, 1),
               ('CRG_COUNTRY_CODE', 12, None, None, 2, 0, 1),
               ('CRG_GOAL_CODE', 12, None, None, 2, 0, 1),
               ('CRG_FUNDING_AMOUNT', 8, None, None, 15, 0, 1))
>>> import dbutils
>>> field_dict = dbutils.fields(c)
>>> pp(field_dict)
{'CRG_COUNTRY_CODE': 4,
               'CRG_FISCAL_YEAR': 2,
               'CRG_FUNDING_AMOUNT': 6,
               'CRG_GOAL_CODE': 5,
               'CRG_ID': 0,
               'CRG_PROGRAM_ID': 1,
               'CRG_REGION_CODE': 3}
>>> row = c.fetchone( )
>>> pp(row)
(45, 3, '2000', None, 'HR', '26', 48509.0)
>>> ctry_code = row[field_dict['CRG_COUNTRY_CODE']]
>>> print ctry_code
HR
>>> fund = row[field_dict['CRG_FUNDING_AMOUNT']]
>>> print fund
48509.0

If you find accesses such as row[field_dict['CRG_COUNTRY_CODE']] to be still inelegant, you may want to get fancier and wrap the row as well as the dictionary of fields into an object allowing more elegant access—a simple example might be:

class neater(object):
    def _ _init_ _(self, row, field_dict):
        self.r = row
        self.d = field_dict
    def _ _getattr_ _(self, name):
        try:
            return self.r[self.d[name]]
        except LookupError:
            raise AttributeError

If this neater class was also in your dubtils module, you could then continue the preceding interactive snippet with, for example:

>>> row = dbutils.neater(row, field_dict)
>>> print row.CRG_FUNDING_AMOUNT48509.0

However, if you’re tempted by such fancier approaches, I suggest that, rather than rolling your own, you have a look at the dbtuple module showcased in Recipe 7.14. Reusing good, solid, proven code is a much smarter approach than writing your own infrastructure.

See Also

Recipe 7.14 for a slicker and more elaborate approach to a very similar task, facilitated by reusing the third-party dbtuple module.

7.14. Using dtuple for Flexible Accessto Query Results

Credit: Steve Holden, Hamish Lawson, Kevin Jacobs

Problem

You want flexible access to sequences, such as the rows in a database query, by either name or column number.

Solution

Rather than coding your own solution, it’s often more clever to reuse a good existing one. For this recipe’s task, a good existing solution is packaged in Greg Stein’s dtuple module:

import dtuple
import mx.ODBC.Windows as odbc
flist = ["Name", "Num", "LinkText"]
descr =dtuple.TupleDescriptor([[n] for n in flist])
conn = odbc.connect("HoldenWebSQL")   # Connect to a database
curs = conn.cursor( )                  # Create a cursor
sql = """SELECT %s FROM StdPage
            WHERE PageSet='Std' AND Num<25
            ORDER BY PageSet, Num""" % ", ".join(flist)
print sql
curs.execute(sql)
rows = curs.fetchall( )
for row in rows:
    row = dtuple.DatabaseTuple(descr, row)
    print "Attribute: Name: %s Number: %d" % (row.Name, row.Num or 0)
    print "Subscript: Name: %s Number: %d" % (row[0], row[1] or 0)
    print "Mapping:   Name: %s Number: %d" % (row["Name"], row["Num"] or 0)
conn.close( )

Discussion

Novice Python programmers are sometimes deterred from using databases because query results are presented by DB API-compliant modules as a list of tuples. Since tuples can only be numerically subscripted, code that uses the query results becomes opaque and difficult to maintain. Greg Stein’s dtuple module, available from http://www.lyra.org/greg/python/dtuple.py, helps by defining two useful classes: TupleDescriptor and DatabaseTuple. To access an arbitrary SQL database, this recipe uses the ODBC protocol through the mxODBC module, http://www.egenix.com/files/python/mxODBC.html, but nothing relevant to the recipe’s task would change if any other standard DB API-compliant module was used instead.

The TupleDescriptor class creates a description of tuples from a list of sequences, the first element of each subsequence being a column name. It is often convenient to describe data with such sequences. For example, in an interactive forms-based application, each column name might be followed by validation parameters such as data type and allowable length. TupleDescriptor’s purpose is to allow the creation of DatabaseTuple objects. In this particular application, no other information about the columns is needed beyond the names, so the required list of sequences is a list of singleton lists (meaning lists that have just one element each), constructed from a list of field names using a list comprehension.

Created from TupleDescriptor and a tuple such as a database row, DatabaseTuple is an object whose elements can be accessed by numeric subscript (like a tuple) or column-name subscript (like a dictionary). If column names are legal Python names, you can also access the columns in your DatabaseTuple as attributes. A purist might object to this crossover between items and attributes, but it’s a highly pragmatic choice in this case. Python is nothing if not a highly pragmatic language, so I see nothing wrong with this convenience.

To demonstrate the utility of DatabaseTuple, the simple test program in this recipe creates a TupleDescriptor and uses it to convert each row retrieved from an SQL query into DatabaseTuple. Because the sample uses the same field list to build both TupleDescriptor and the SQL SELECT statement, it demonstrates how database code can be parameterized relatively easily.

Alternatively, if you wish to get all the fields (an SQL SELECT * query), and dynamically get the field names from the cursor, as previously described in Recipe 7.13, you can do so. Just remove variable flist, which you don’t need any more, and move the construction of variable descr to right after the call to the cursor’s execute method, as follows:

curs.execute(sql)
descr = dtuple.TupleDescriptor(curs.description)

The rest of the recipe can remain unchanged.

A more sophisticated approach, with functionality similar to dtuple’s and even better performance, is offered by the Python Database Row Module (also known as db_row) made freely available by the OPAL Group. For downloads and information, visit http://opensource.theopalgroup.com/.

Module pysqlite, which handles relational databases in memory or in files by wrapping the SQLite library, does not return real tuples from such methods as fetchall: rather, it returns instances of a convenience class that wraps tuple and also allows field access with attribute-access syntax, much like the approaches mentioned in this recipe.

See Also

Recipe 7.13 for a simpler, less functionally rich way to convert field names to column numbers; the dtuple module is at http://www.lyra.org/greg/python/dtuple.py; OPAL’s db_row is at http://opensource.theopalgroup.com/; SQLite, a fast, lightweight, embedded relational database (http://www.sqlite.org/), and its Python DB API interface module pysqlite (http://pysqlite.sourceforge.net/).

7.15. Pretty-Printing the Contents of Database Cursors

Credit: Steve Holden, Farhad Fouladi, Rosendo Martinez, David Berry, Kevin Ryan

Problem

You want to present a query’s result with appropriate column headers (and optionally widths), but you do not want to hard-code this information in your program. Indeed, you may not even know the column headers and widths at the time you’re writing the code.

Solution

Discovering the column headers and widths dynamically is the most flexible approach, and it gives you code that’s highly reusable over many such presentation tasks:

def pp(cursor, data=None, check_row_lengths=False):
    if not data:
        data = cursor.fetchall( )
    names = [  ]
    lengths = [  ]
    rules = [  ]
    for col, field_description in enumerate(cursor.description):
        field_name = field_description[0]
        names.append(field_name)
        field_length = field_description[2] or 12
        field_length = max(field_length, len(field_name))
        if check_row_lengths:
            # double-check field length, if it's unreliable
            data_length = max([ len(str(row[col])) for row in data ])
            field_length = max(field_length, data_length)
        lengths.append(field_length)
        rules.append('-' * field_length)
    format = " ".join(["%%-%ss" % l for l in lengths])
    result = [ format % tuple(names), format % tuple(rules) ]
    for row in data:
        result.append(format % tuple(row))
    return "
".join(result)

Discussion

Relational databases are often perceived as difficult to use. The Python DB API can make them much easier to use, but if your programs work with several different DB engines, it’s sometimes tedious to reconcile the implementation differences between the various modules, and, even more, between the engines they connect to. One of the problems of dealing with databases is presenting the result of a query when you may not know much about the data. This recipe uses the cursor’s description attribute to try to provide appropriate headings. The recipe optionally examines each output row to ensure that column widths are adequate.

In some cases, a cursor can yield a solid description of the data it returns, but not all database modules are kind enough to supply cursors that do so. The pretty printer function pp shown in this recipe’s Solution takes as its first argument a cursor, on which you have just executed a retrieval operation (generally the execute of an SQL SELECT statement). It also takes an optional argument for the returned data; to use the data for other purposes, retrieve the data from the cursor, typically with fetchall, and pass it in as pp’s data argument. The second optional argument tells the pretty printer to determine the column lengths from the data, rather than trusting the cursor’s description; checking the data for column lengths can be time-consuming, but is helpful with some RDBMS engines and DB API module combinations, where the widths given by the cursor’s description attribute can be inaccurate.

A simple test program shows the value of the second optional argument when a Microsoft Jet database is used through the mxODBC module:

import mx.ODBC.Windows as odbc
import dbcp # contains pp function
conn = odbc.connect("MyDSN")
curs = conn.cursor( )
curs.execute("""SELECT Name, LinkText, Pageset FROM StdPage
                ORDER BY PageSet, Name""")
rows = curs.fetchall( )
print "

Without rowlens:"
print dbcp.pp(curs, rows)
print "

With rowlens:"
print dbcp.pp(curs, rows, rowlens=1)
conn.close( )

In this case, the cursor’s description does not include column lengths. The first output shows that the default column length of 12 is too short. The second output corrects this by examining the data:

Without rowlens:
Name         LinkText     Pageset
------------ ------------ ------------
ERROR        ERROR: Cannot Locate Page None
home         Home None
consult      Consulting Activity Std
ffx          FactFaxer    Std
hardware     Hardware Platforms Std
python       Python       Std
rates        Rates        Std
technol      Technologies Std
wcb          WebCallback  Std
With rowlens:
Name         LinkText                  Pageset
------------ ------------------------- ------------
ERROR        ERROR: Cannot Locate Page None
home         Home                      None
consult      Consulting Activity       Std
ffx          FactFaxer                 Std
hardware     Hardware Platforms        Std
python       Python                    Std
rates        Rates                     Std
technol      Technologies              Std
wcb          WebCallback               Std

Module pysqlite, which handles relational databases in memory or in files by wrapping the SQLite library, is another example of a DB API module whose cursors’ descriptions do not contain reliable values for field lengths. Moreover, pysqlite does not return real tuples from such methods as fetchall: rather, it returns instances of a convenience class which wraps tuple and also allocws field access with attribute access syntax, much like the approaches presented in Recipe 7.14. To deal with such small variations from the DB API specifications, this recipe carefully uses tuple(row), not just row, as the right-hand operand of operator % in the statement result.append(format % tuple(row)). Python’s semantics specify that if the right-hand operand is not a tuple, then the left-hand (format string) operand may contain only one format specifier. This recipe uses a tuple as the right-hand operand because the whole point of the recipe is to build and use a format string with many format specifiers, one per field.

This recipe’s function is useful during testing, since it lets you easily verify that you are indeed retrieving what you expect from the database. The output is pretty enough to display ad hoc query outputs to users. The function currently makes no attempt to represent null values other than the None the DB API returns, though it could easily be modified to show a null string or some other significant value.

See Also

The mxODBC package, a DB API-compatible interface to ODBC (http://www.egenix.com/files/python/mxODBC.html); SQLite, a fast, lightweight embedded relational database (http://www.sqlite.org/), and its Python DB API interface module pysqlite (http://pysqlite.sourceforge.net/).

7.16. Using a Single Parameter-Passing Style Across Various DB API Modules

Credit: Denis S. Otkidach

Problem

You want to write Python code that runs under any DB API compliant module, but such modules can use different styles to allow parameter passing.

Solution

We need a set of supporting functions to convert SQL queries and parameters to any of the five possible parameter-passing styles:

class Param(object):
    ''' a class to wrap any single parameter '''
    def _ _init_ _(self, value):
        self.value = value
    def _ _repr_ _(self):
        return 'Param(%r)' % (self.value,)
def to_qmark(chunks):
    ''' prepare SQL query in '?' style '''
    query_parts = [  ]
    params = [  ]
    for chunk in chunks:
        if isinstance(chunk, Param):
            params.append(chunk.value)
            query_parts.append('?')
        else:
            query_parts.append(chunk)
    return ''.join(query_parts), params
def to_numeric(chunks):
    ''' prepare SQL query in ':1' style '''
    query_parts = [  ]
    params = [  ]
    for chunk in chunks:
        if isinstance(chunk, Param):
            params.append(chunk.value)
            query_parts.append(':%d' % len(params))
        else:
            query_parts.append(chunk)
    # DCOracle2 needs, specifically, a _tuple_ of parameters:
    return ''.join(query_parts), tuple(params)
def to_named(chunks):
    ''' prepare SQL query in ':name' style '''
    query_parts = [  ]
    params = {  }
    for chunk in chunks:
        if isinstance(chunk, Param):
            name = 'p%d' % len(params)
            params[name] = chunk.value
            query_parts.append(':%s' % name)
        else:
            query_parts.append(chunk)
    return ''.join(query_parts), params
def to_format(chunks):
    ''' prepare SQL query in '%s' style '''
    query_parts = [  ]
    params = [  ]
    for chunk in chunks:
        if isinstance(chunk, Param):
            params.append(chunk.value)
            query_parts.append('%s')
        else:
            query_parts.append(chunk.replace('%', '%%'))
    return ''.join(query_parts), params
def to_pyformat(chunks):
    ''' prepare SQL query in '%(name)s' style '''
    query_parts = [  ]
    params = {  }
    for chunk in chunks:
        if isinstance(chunk, Param):
            name = 'p%d' % len(params)
            params[name] = chunk.value
            query_parts.append('%%(%s)s' % name)
        else:
            query_parts.append(chunk.replace('%', '%%'))
    return ''.join(query_parts), params
converter = {  }
for paramstyle in ('qmark', 'numeric', 'named', 'format', 'pyformat'):
    converter[paramstyle] = globals['to_' + param_style]
def execute(cursor, converter, chunked_query):
    query, params = converter(chunked_query)
    return cursor.execute(query, params)
if _ _name_ _=='_ _main_ _':
    query = ('SELECT * FROM test WHERE field1>', Param(10),
             ' AND field2 LIKE ', Param('%value%'))
    print 'Query:', query
    for paramstyle in ('qmark', 'numeric', 'named', 'format', 'pyformat'):
        print '%s: %r' % (paramstyle, converter[param_style](query))

Discussion

The DB API specification is quite handy, but it has one most annoying problem: it allows compliant modules to use any of five parameter styles. So you cannot necessarily switch to another database just by changing the database module: if the parameter-passing styles of two such modules differ, you need to rewrite all SQL queries that use parameter substitution. Using this recipe, you can improve this situation a little. Pick the appropriate converter from the converter dictionary (indexing it with the paramstyle attribute of your current DB API module), write your queries as mixed chunks of SQL strings and instances of the provided Param class (as exemplified in the if _ _name_ _=='_ _main_ _' part of the recipe), and execute your queries through the execute function in this recipe. Not a perfectly satisfactory solution, by any means, but way better than nothing!

See Also

The DB API docs at http://www.python.org/peps/pep-0249.html; the list of DB API-compliant modules at http://www.python.org/topics/database/modules.html.

7.17. Using Microsoft Jet via ADO

Credit: Souman Deb

Problem

You need to access a Microsoft Jet database via Microsoft’s ADO, for example from a Python-coded CGI script for the Apache web-server.

Solution

The CGI script must live in Apache’s cgi-bin directory and can use the PyWin32 extensions to connect, via COM, to ADO and hence to Microsoft Jet. For example:

#!C:Python23python
print "Content-type:text/html

"
import win32com
db='C:\Program Files\Microsoft Office\Office\Samples\Northwind.mdb'
MAX_ROWS=2155
def connect(query):
    con = win32com.client.Dispatch('ADODB.Connection')
    con.Open("Provider=Microsoft.Jet.OLEDB.4.0; Data Source="+db)
    result_set = con.Execute(query + ';')
    con.Close( )
    return result_set
def display(columns, MAX_ROWS):
    print "<table border=1>"
    print "<th>Order ID</th>"
    print "<th>Product</th>"
    print "<th>Unit Price</th>"
    print "<th>Quantity</th>"
    print "<th>Discount</th>"
    for k in range(MAX_ROWS):
        print "<tr>"
        for field in columns:
                print "<td>", field[k], "</td>"
        print "</tr>"
    print "</table>"
result_set = connect("select * from [Order details]")
columns = result_set[0].GetRows(MAX_ROWS)
display(columns, MAX_ROWS)
result_set[0].Close

Discussion

This recipe uses the “Northwind Database” example that Microsoft distributes with several of its products, such as Microsoft Access. To run this recipe, you need a machine running Microsoft Windows with working installations of other Microsoft add-ons such as OLEDB, ADO, and the Jet database driver, which is often (though not correctly) known as “the Access database”. (Microsoft Access is a product to build database frontend applications, and it can work with other database drivers, such as Microsoft SQL Server, not just with the freely distributable and downloadable Microsoft Jet database drivers.) Moreover, you need an installation of Mark Hammond’s PyWin32 package (formerly known as win32all); the Python distribution known as ActivePython, from ActiveState, comes with (among other things) PyWin32 already installed.

If you want to run this recipe specifically as an Apache CGI script, of course, you also need to install Apache and to place this script in the cgi-bin directory where Apache expects to find CGI scripts (the location of the cgi-bin directory depends on how you have installed Apache on your machine).

Make sure that the paths in the script are correct, depending on where, on your machine, you have installed the python.exe file you want to use, and the Northwind.mdb database you want to query. The paths indicated in the recipe correspond to default installations of Python 2.3 and the “Northwind” example database. If the script doesn’t work correctly, check the Apache error.log file, where you will find error messages that may help you find out what kind of error you’re dealing with.

To try the script, assuming that, for example, you have saved it as cgi-bin/adoexample.py and that your Apache server is running correctly, visit with any browser the URL http://localhost/cgi-bin/adoexample.py. One known limitation of the interface between Python and Jet databases with ADO is on fields of type currency: such fields are returned as some strange tuples, rather than as plain numbers. This recipe does not deal with that limitation.

See Also

Documentation for the Win32 API in PyWin32 (http://starship.python.net/crew/mhammond/win32/Downloads.html) or ActivePython (http://www.activestate.com/ActivePython/); Windows API documentation available from Microsoft (http://msdn.microsoft.com); Mark Hammond and Andy Robinson, Python Programming on Win32 (O’Reilly).

7.18. Accessing a JDBC Database from a Jython Servlet

Credit: Brian Zhou

Problem

You’re writing a servlet in Jython, and you need to connect to a database server (such as Oracle, Sybase, Microsoft SQL Server, or MySQL) via JDBC.

Solution

The technique is basically the same for any kind of database, give or take a couple of statements. Here’s the code for when your database is Oracle:

import java, javax
class emp(javax.servlet.http.HttpServlet):
    def doGet(self, request, response):
        ''' a Servlet answers a Get query by writing to the response's
            output stream.  In this case we ignore the request, though
            in normal, non-toy cases that's where we get form input from.
        '''
        # we answer in plain text, so set the content type accordingly
        response.setContentType("text/plain")
        # get the output stream, use it for the query, then close it
        out = response.getOutputStream( )
        self.dbQuery(out)
        out.close( )
    def dbQuery(self, out):
        # connect to the Oracle driver, building an instance of itdriver = "oracle.jdbc.driver.OracleDriver"
        java.lang.Class.forName(driver).newInstance( )
        # get a connection to the Oracle driver w/given user and password
        server, db = "server", "ORCL"
        url = "jdbc:oracle:thin:@" + server + ":" + db
        usr, passwd = "scott", "tiger"
        conn = java.sql.DriverManager.getConnection(url, usr, passwd)
        # send an SQL query to the connection
        query = "SELECT EMPNO, ENAME, JOB FROM EMP"
        stmt = conn.createStatement( )
        if stmt.execute(query):
            # get query results and print the out to the out stream
            rs = stmt.getResultSet( )
            while rs and rs.next( ):
                out.println(rs.getString("EMPNO"))
                out.println(rs.getString("ENAME"))
                out.println(rs.getString("JOB"))
                out.println( )
        stmt.close( )
        conn.close( )

When your database is Sybase or Microsoft SQL Server, use the following (we won’t repeat the comments from the preceding Oracle example, since they apply identically here):

import java, javax
class titles(javax.servlet.http.HttpServlet):
    def doGet(self, request, response):
        response.setContentType("text/plain")
        out = response.getOutputStream( )
        self.dbQuery(out)
        out.close( )
    def dbQuery(self, out):driver = "sun.jdbc.odbc.JdbcOdbcDriver"
        java.lang.Class.forName(driver).newInstance( )
        # Use "pubs" DB for mssql and "pubs2" for Sybase
        url = "jdbc:odbc:myDataSource"
        usr, passwd = "sa", "password"
        conn = java.sql.DriverManager.getConnection(url, usr, passwd)
        query = "select title, price, ytd_sales, pubdate from titles"
        stmt = conn.createStatement( )
        if stmt.execute(query):
            rs = stmt.getResultSet( )
            while rs and rs.next( ):
                out.println(rs.getString("title"))
                if rs.getObject("price"):
                    out.println("%2.2f" % rs.getFloat("price"))
                else:
                    out.println("null")
                if rs.getObject("ytd_sales"):
                    out.println(rs.getInt("ytd_sales"))
                else:
                    out.println("null")
                out.println(rs.getTimestamp("pubdate").toString( ))
                out.println( )
        stmt.close( )
        conn.close( )

And here’s the code for when your database is MySQL:

import java, javax
class goosebumps(javax.servlet.http.HttpServlet):
    def doGet(self, request, response):
        response.setContentType("text/plain")
        out = response.getOutputStream( )
        self.dbQuery(out)
        out.close( )
    def dbQuery(self, out):driver = "org.gjt.mm.mysql.Driver"
        java.lang.Class.forName(driver).newInstance( )
        server, db = "server", "test"
        usr, passwd = "root", "password"
        url = "jdbc:mysql://%s/%s?user=%s&password=%s" % (
            server, db, usr, passwd)
        conn = java.sql.DriverManager.getConnection(url)
        query = "select country, monster from goosebumps"
        stmt = conn.createStatement( )
        if stmt.execute(query):
            rs = stmt.getResultSet( )
            while rs and rs.next( ):
                out.println(rs.getString("country"))
                out.println(rs.getString("monster"))
                out.println( )
        stmt.close( )

Discussion

You might want to use different JDBC drivers and URLs, but you can see that the basic technique is quite simple and straightforward. This recipe’s code uses a content type of text/plain because the recipe is about accessing the database, not about formatting the data you get from it. Obviously, you can change this content type to whichever is appropriate for your application.

In each case, the basic technique is first to instantiate the needed driver (whose package name, as a string, we place in variable driver) via the Java dynamic loading facility. The forName method of the java.lang.Class class loads and provides the relevant Java class, and that class’ newInstance method ensures that the driver we need is instantiated. Then, we can call the getConnection method of java.sql.DriverManager with the appropriate URL (or username and password, where needed) and thus obtain a connection object to place in the conn variable. From the connection object, we can create a statement object with the createStatement method and use it to execute a query that we have in the query string variable with the execute method. If the query succeeds, we can obtain the results with the getResultSet method. Finally, Oracle and MySQL allow easy sequential navigation of the result set to present all results, while Sybase and Microsoft SQL Server need a bit more care. Overall, the procedure is similar in all cases.

See Also

The Jython site (http://www.jython.org); JDBC’s home page (http://java.sun.com/products/jdbc).

7.19. Using ODBC to Get Excel Data with Jython

Credit: Zabil CM

Problem

Your Jython script needs to extract data from a Microsoft Excel file.

Solution

Jython, just like Java, can access ODBC through the JDBC-ODBC Bridge, and Microsoft Excel can in turn be queried via ODBC:

from java import lang, sql
lang.Class.forName('sun.jdbc.odbc.JdbcOdbcDriver')
excel_file = 'values.xls'
connection = sql.DriverManager.getConnection(
    'jdbc:odbc:Driver={Microsoft Excel Driver (*.xls)};DBQ=%s;READONLY=true}' %
    excel_file, '', '')
# Sheet1 is the name of the Excel workbook we want.  The field names for the
# query are implicitly set by the values for each column in the first row.
record_set = connection.createStatement( ).executeQuery(
             'SELECT * FROM [Sheet1$]')
# print the first-column field of every record (==row)
while record_set.next( ):
    print record_set.getString(1)
# we're done, close the connection and recordset
record_set.close( )
connection.close( )

Discussion

This recipe is most easily used on Microsoft Windows, where installing and configuring ODBC, and the Microsoft Excel ODBC driver in particular, is best supported. However, with suitable commercial products, you can equally well use the recipe on an Apple Macintosh or just about any other Unix version on the planet.

Using ODBC rather than alternate ways to access Microsoft Excel has one substantial advantage that is not displayed in this recipe: with ODBC, you can use a broad subset of SQL. For example, you can easily extract a subset of a workbook’s row by adding a WHERE clause, such as:

SELECT * FROM [Sheet1$] WHERE DEPARTMENT=9

Since all of the selection logic can be easily expressed in the SQL string you pass to the executeQuery method, this approach lends itself particularly well to being encapsulated in a simple reusable function.

If you’re coding in Classic Python (CPython) rather than Jython, you can’t use JDBC, but you can use ODBC directly (typically in the DB API-compliant way supported by mxODBC, http://www.egenix.com/files/python/mxODBC.html) to perform this recipe’s task in a similar way.

See Also

The Jython site (http://www.jython.org); JDBC’s home page (http://java.sun.com/products/jdbc); Recipe 12.7, for another way to access Excel data (by parsing the XML file that Excel can be asked to output).



[1] E.F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, 13, no. 6 (1970), pp. 377-87, http://www.acm.org/classics/nov95/toc.html.

[2] Microsoft Jet is commonly but erroneously known as the “Microsoft Access database.” Access is a product that Microsoft sells for designing and implementing database frontends; Jet is a backend that you may download for free from Microsoft’s web site.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.8.8