Just about every large enterprise system uses a database for storing data. For example, Amazon.com, the online retailer, needs a database to store information on each product for sale. For Python to prove capable of handling these types of enterprise applications, the language must be able to access databases.
Luckily, Python provides a database API (Application Programming Interface — how you program for the database), which is a generic API that enables you to access most databases, in spite of the databases' different native APIs. The database, or DB, API doesn't define all aspects of working with databases, so some minor differences exist. For the most part, though, you can access databases such as Oracle or MySQL from your Python scripts without worrying too much about the details of the specific databases.
Having a generic database API is very useful because you may need to switch databases or have your application work with multiple databases, and you won't want to recode major parts of your program to allow this. Normally, you can do all of this in Python without a lot of programming changes.
Even if you aren't writing the next amazon.com online site, databases provide a convenient means to persist data for longer than the program is running (so that you don't lose the data that a user has entered if you want to restart your program), query for items, and modify your data in a safe manner.
This chapter covers the two main database systems supported by Python: dbm persistent dictionaries and relational databases with the DB API. In addition, this chapter describes how to set up a database, in case you don't have one handy.
In this chapter you learn:
Using the dbm libraries to create persistent dictionaries
About relational databases
Setting up the Sqlite database
Setting up the MySQL database
Working with the Python DB API
Creating connections
Accessing data with cursors
Connecting to databases
Querying and modifying data
Working with transactions
Handling errors
Using other database tools
In many cases, you don't require a full-blown relational database. In such cases, creating a persistent dictionary using dbm files is enough.
A persistent dictionary acts exactly like you'd expect. You can store name/value pairs in the dictionary, which are saved to a disk, and so their data will endure between various times that your program is run. So if you save data to a dictionary that's backed by a dbm, the next time you start your program you can read the value stored under a given key again, once you've loaded the dbm file. These dictionaries work like normal Python dictionaries, which are covered in Chapter 3. The main difference is that the data is written to and read from disk.
An additional difference is that the keys and the values must both be strings.
DBM, short for database manager, acts as a generic name for a number of C language libraries originally created on UNIX systems. These libraries sport names such as dbm, gdbm, ndbm, sdbm, and so on. These names correspond closely to the available modules in Python that provide the requisite functionality.
Python supports a number of dbm modules. Each dbm module supports a similar interface and uses a particular C library to store the data to disk. The main difference lies in the underlying binary format of the data files on disk. Each dbm module, unfortunately, creates incompatible files. That is, if you create a dbm persistent dictionary with one dbm module, you must use the same module to read the data. None of the other modules will work with that data file.
The following table lists the dbm modules.
Module | Description |
---|---|
dbm | Chooses the best dbm module |
dbm.dumb | Uses a simple, but portable, implementation of the dbm library |
dbm.gnu | Uses the GNU dbm library |
All of these libraries exist because of the history of the dbm library. Originally, this library was available only on commercial versions of UNIX. Free versions of UNIX, and later Linux, Windows, and so on, could not use the dbm library. This led to the creation of alternative libraries, such as the Berkeley UNIX library and the GNU gdbm library.
With all the incompatible file formats, this plethora of libraries can be a real pain. The dbm module, though, offers a handy alternative to choosing a specific dbm module. With the dbm module, you can let it choose for you. In general, the dbm module will choose the best implementation available on your system when creating a new persistent dictionary. When reading a file, the dbm module uses the function whichdb
to make an informed guess as to which library created the data file.
Unless you need a specific advanced feature of one of the dbm libraries, use the dbm module.
All of the dbm modules support an open
function to create a new dbm
object. Once opened, you can store data in the dictionary, read data, close the dbm
object (and the associated data file or files), remove items, and test for the existence of a key in the dictionary.
To open a dbm persistent dictionary, use the open
function on the module you choose. For example, you can create a persistent dictionary with the dbm module.
With the dbm modules, you can treat the object you get back from the open
function as a dictionary object. Get and set values using code like the following:
db['key'] = 'value' value = db['key']
Remember that the key and the value must both be text strings.
You can delete a value in the dictionary using del
:
del db['key
']
The keys
method returns a list of all the keys, in the same way it would with a normal dictionary:
for key in db.keys(): # do something...
The keys
method may take a long time to execute if there are a huge number of keys in the file. In addition, this method may require a lot of memory to store the potentially large list that it would create with a large file.
You can use the following script as a guide for how to program with dbm persistent dictionaries.
The dbm modules work when your data needs can be stored as key/value pairs. You can store more complicated data within key/value pairs with some imagination — for instance, by creating formatted strings that use a comma or some other character to delimit items in the strings, both on the key and the value part of the dictionary. This can be useful, but it can also be very difficult to maintain, and it can restrict you because your data is stored in an inflexible manner. Another way that you can be limited is technical: Note that some dbm libraries limit the amount of space you can use for the values (sometimes to a maximum of 1024 bytes, which is very, very little).
You can use the following guidelines to help determine which of these two types of data storage is appropriate for your needs:
If your data needs are simple, use a dbm persistent dictionary.
If you plan to store only a small amount of data, use a dbm persistent dictionary.
If you require support for transactions, use a relational database. (Transactions are when more than one thing happens at once — they let you keep your data from getting changed in one place but not in another; you get to define what happens concurrently with transactions.)
If you require complex data structures or multiple tables of linked data, use a relational database.
If you need to interface to an existing system, use that system, obviously. Chances are good this type of system will be a relational database.
Unlike the simple dbm modules, relational databases provide a far richer and more complex API.
On a side note, I should mention that there is a third type of database you can work with, though it is beyond the scope of this chapter. That third type is known as an ORM or object-relational database, and it allows for the conversion of data between type systems that are incompatible in relational databases.
Python has several options available if you wish to work with an ORM, such as SQL Object, SQLAlchemy, and even the Django ORM. For more information, visit the Python Wiki at http://wiki.python.org/moin/HigherLevelDatabaseProgramming
.
Relational databases have been around for decades so they are a mature and well-known technology. People who work with relational databases know what they are supposed to do, and how they are supposed to work, so relational databases are the technology of choice for complex data storage.
In a relational database, data is stored in tables that can be viewed as two-dimensional data structures. The columns, or vertical part of the two-dimensional matrix, are all of the same type of data; like strings, numbers, dates, and so on. Each horizontal component of the table is made up of rows, also called records. Each row in turn is made up of columns. Typically, each record holds the information pertaining to one item, such as an audio CD, a person, a purchase order, an automobile, and so on.
For example, the following shows a simple employee table.
Empid | firstname | lastname | department | manager | phone |
---|---|---|---|---|---|
105 | Peter | Tosh | 2 | 45 | 555-5555 |
201 | Bob | Marley | 1 | 36 | 555-5551 |
This table holds six columns:
empid: Holds the employee ID number. Relational databases make extensive use of ID numbers where the database manages the assignment of unique numbers so that each row can be referenced with these numbers to make each row unique (even if they have identical data). We can then refer to each employee by the ID number. The ID alone provides enough information to look up the employee.
firstname: Holds the person's first name.
lastname: Holds the person's last name.
department: Holds the ID of the department in which the employee works. This would likely be a numeric ID of the department, where departments are defined in a separate table that has a unique ID for each department.
manager: Holds the employee ID of the manager of the given employee. This is sort of self-referential, because in this example, a manager is actually an employee.
phone: Holds the office phone number.
In real life, a company would likely store a lot more information about an employee, such as a taxation authority identification number (Social Security number in the U.S.), home address, and more, but not anything that's really different in principle to what you've already seen.
In this example, the column empid
, the employee ID, would be used as the primary key. A primary key is a unique index for a table, where each element has to be unique because the database will use that element as the key to the given row and as the way to refer to the data in that row, in a manner similar to dictionary keys and values in Python. So, each employee needs to have a unique ID number, and once you have an ID number, you can look up any employee. So, the empid
will act as the key into this table's contents.
The department
column holds an ID of a department — that is, an ID of a row in another table. This ID could be considered a foreign key, because the ID acts as a key into another table. (In databases, a foreign key has a much more strict definition, so it's okay to think of it this way.)
For example, the following table shows a possible layout for the department table.
department id | name | manager |
---|---|---|
1 | Development | 47 |
2 | QA | 32 |
In these examples, the employee Peter Tosh works for department 2, the QA, or Quality Assurance, department in a dynamic world-class high-quality software development firm. Bob Marley works for department 1, the Development department.
In a large enterprise, there may be hundreds of tables in the database, with thousands or even millions of records in some tables.
The Structured Query Language, or SQL, defines a standard language for querying and modifying databases.
You can pronounce SQL as "sequel" or "s-q-l."
SQL supports the basic operations listed in the following table.
Operation | Usage |
---|---|
Select | Perform a query to search the database for specific data. |
Update | Modify a row or rows, usually based on a certain condition. |
Insert | Create new rows in the database. |
Delete | Remove a row or rows from the database. |
In general, these basic operations are called QUID, short for Query, Update, Insert, and Delete, or CRUD, short for Create, Read, Update, and Delete. SQL offers more than these basic operations, but for the most part, these are the majority of what you're going to use to write applications.
If you are not familiar with SQL, look at a SQL book or search on the Internet. You will find a huge amount of tutorial material. You can also look at the website for this book for more references to SQL resources.
SQL is important because when you access databases with the Python DB API, you must first create SQL statements and then execute these statements by having the database evaluate them. You then retrieve the results and use them. Thus, you will find yourself in the awkward position of using one language, Python, to create commands in another language, SQL.
The basic SQL syntax for the CRUD operations follows:
SELECT columns FROM tables WHERE condition ORDER BY columns ascending_or_ descending UPDATE table SET new values WHERE condition INSERT INTO table (columns) VALUES (values) DELETE FROM table WHERE condition
In addition to this basic look at the available syntax, many more parameters and specifiers for each operation are optional. You can still use them with Python's DB API if you're familiar with SQL.
To insert a new row in the employee table, using the previous employee example, you can use a SQL query like the following (even though it's adding data and not getting data, the convention is that all SQL commands or statements can also be called queries):
insert into employee (empid, firstname, lastname, manager, dept, phone) values (3, 'Bunny', 'Wailer', 2, 2, '555-5553')
In this example, the first tuple (it's useful to think of these in Python terms, even though SQL will give these different names) holds the names of the columns in the order you are using for inserting your data. The second tuple, after the keyword values
, holds the data items in the same order. Notice how SQL uses single quotes to delimit strings, and no quotes around numbers. (The phone number is different — it's actually a string because it has to be able to contain non-numbers, like dashes, periods, and plus signs, depending on how the data is entered.)
With queries, you can use shortcuts such as *
to say that you want an operation to be performed using all of the columns in a table. For example, to query all of the rows in the department table, showing all of the columns for each row, you can use a query like the following:
select * from department
Note that SQL is not case-sensitive for its keywords, such as
SELECT
andFROM
. But, some databases require table and column names to be all uppercase. It is common, therefore, to see people useSELECT
andFROM
and other operations in all capital letters to make them easily distinguished from other parts of the query.
This SQL statement omits the names of the columns to read and any conditions that would otherwise narrow down the data that would be returned. Thus the query will return all of the columns (from the *
) and all of the rows (because there is no where
clause).
You can perform a join with the select
command, to query data from more than one table, but present it all in a single response. It's called a join because the data from both tables will be returned as though it was queried from a single table. For example, to extract the department name with each employee, you could perform a query like the following (all of which would need to be in one string to be a single query):
select employee.firstname, employee.lastname, department.name from employee, department where employee.dept = department.departmentid order by lastname desc
In this example, the select
statement requests two columns from the employee table (the firstname
and the lastname
, but these are specified as coming from employee
by the convention of specifying the table name and the column name in the table) and one from the department table (department.name
). The order by
section of the statement tells the database to order the results by the value in the lastname
column, in descending order.
To simplify these queries, you can use aliases for the table names, which make them easier to type and to read (but don't change the logic or the syntax of your queries). For example, to use the alias e with the employee table, you can start a query as follows:
select e.firstname, e.lastname from employee e ...
In this case, you must place the alias, e, after the table name in the from
clause. You can also use the following format with the optional keyword as
, which could be easier for you to read:
select e.firstname, e.lastname from employee as e ...
To modify (or update) a row, use a SQL statement like the following:
update employee set manager=55 where empid=3
This example modifies the employee with an ID of 3 by setting that employee's manager to the employee with an ID of 55. As with other queries, numbers don't need to have quotes around them; however, strings would need to be quoted with single quotes.
To delete a row, use a SQL statement like the following:
delete employee where empid=42
This example deletes the employee with an ID of 42 but doesn't affect anything else in the database.
When you first set up a database, you need to define the tables and the relations between them. To do this, you use the part of the SQL called the DDL, or Data Definition Language. (It defines the structure of your tables — get it?) DDL basics are pretty simple, where you use one operation to create tables, and another one to remove them:
CREATE TABLE tablename (column, type column type, ... ) DROP TABLE tablename
There is also an ALTER TABLE
command to modify an existing table, but you won't need to do that for now. When you want to use this, a dedicated SQL book or web page will have more about this command.
Unfortunately, SQL is not an entirely standard language, and there are parts of it that each database handles differently. The DDL remains a part of SQL that has not been standardized. Thus, when defining tables you will find differences between the SQL dialects supported by the different databases, though the basics concepts are the same.
In most cases when you're the programmer, you will already have a database that's up and running, perhaps even a database chosen by some other organization that you're going to have to use. For example, if you host your website with a website hosting company that provides bells and whistles, like a database, your hosting package may include access to the MySQL database. If you work for a large organization, your IT department may have already standardized on a particular database such as Oracle, DB/2, Sybase, or Informix. These latter packages are likely present in your workplace if you create enterprise applications with Python.
If you have no database at all, yet still want to work on the examples in this chapter, a good starting database is Sqlite. The main virtues of Sqlite are that it comes installed with Python, and it is simple and small, but functional. This makes it a great candidate for experimentation while you're learning, even if you have another database available to you. Just keep in mind that each database has its own quirks.
The examples in this chapter were written to work with Sqlite so that you can follow them without any external infrastructure being needed. You can easily modify these examples, though, to work with a different database. That's one of the great aspects of the Python DB API.
Using Sqlite is as simple as importing the module. The following example shows you all you need to create a database.
If you are working with another database, such as SQL Server, chances are good that a database has already been created. If not, follow the instructions from your database vendor. (A lot of the time, you can get help on tasks like this from your Database Administrator, or DBA, who would really rather have you working on a test database instead of on a production database.)
With Sqlite, creating a database is rather easy.
First, some history about Python and relational databases. Python's support for relational databases started out with ad hoc solutions, with one solution written to interface with each particular database, such as Oracle. Each database module created its own API, which was highly specific to that database because each database vendor evolved its own API based on its own needs. This is hard to support, because coding for one database and trying to move it to the other gives a programmer severe heartburn, as everything needs to be completely rewritten and retested.
Over the years, though, Python has matured to support a common database, or DB, API, that's called the DB API. Specific modules enable your Python scripts to communicate with different databases, such as DB/2, PostgreSQL, and so on. All of these modules, however, support the common API, making your job a lot easier when you write scripts to access databases. This section covers this common DB API.
The DB API provides a minimal standard for working with databases, using Python structures and syntax wherever possible. This API includes the following:
Connections, which cover guidelines for how to connect to databases
Executing statements and stored procedures to query, update, insert, and delete data with cursors
Transactions, with support for committing or rolling back a transaction
Examining metadata on the database module as well as on database and table structure
Defining the types of errors
The following sections take you step by step through the Python database APIs.
You must download a separate DB API module for each database you need to access. For example, if you need to access an Oracle database as well as a MySQL database, you must download both the Oracle and the MySQL database modules.
See
http://wiki.python.org/moin/DatabaseInterfaces
for a listing of databases.
Modules exist for most major databases with the notable exception of Microsoft's SQL Server. You can access SQL Server using an ODBC module, though. In fact, the mxODBC module can communicate with most databases using ODBC on Windows or an ODBC bridge on UNIX (including Mac OS X) or Linux. If you need to do this, you can search for more information on these terms online to find out how other people are doing it.
Download the modules you need. Follow the instructions that come with the modules to install them.
You may need a C compiler and build environment to install some of the database modules. If you do, this will be described in the module's own documentation, which you'll need to read.
For some databases, such as Oracle, you can choose among a number of slightly different modules. You should choose the module that seems to best fit your needs or go to the website for this book and ask the authors for any recommendations if you're not sure.
Once you have verified that the necessary modules are installed, you can start working with Connections.
A Connection
object provides the means to communicate from your script to a database program. Note the major assumption here that the database is running in a separate process (or processes). The Python database modules connect to the database. They do not include the database application itself.
Each database module needs to provide a connect function that returns a connection
object. The parameters that are passed to connect vary by the module and what is required to communicate with the database. The following table lists the most common parameters.
Parameter | Usage |
---|---|
Dsn | Data source name, from ODBC terminology. This usually includes the name of your database and the server where it's running. |
Host | Host, or network system name, on which the database runs. |
Database | Name of the database. |
User | User name for connecting to the database. |
Password | Password for the given user name. |
For example, you can use the following code as a guide:
conn = dbmodule
.connect(dsn='localhost:MYDB',user='tiger',password='scott')
Use your database module documentation to determine which parameters are needed.
With a Connection
object, you can work with transactions, covered later in this chapter; close the connection to free system resources, especially on the database; and get a cursor.
A cursor
is a Python object that enables you to work with the database. In database terms, the cursor is positioned at a particular location within a table or tables in the database, sort of like the cursor on your screen when you're editing a document, which is positioned at a pixel location.
To get a cursor, you need to call the cursor
method on the connection
object:
cursor = conn.cursor()
Once you have a cursor, you can perform operations on the database, such as inserting records.
Each connection
, while it is engaged in an action, manages a transaction. With SQL, data is not modified unless you commit a transaction. The database then guarantees that it will perform all of the modifications in the transaction or none. Thus, you will not leave your database in an uncertain and potentially erroneous state.
To commit a transaction, call the commit
method of a connection:
conn.commit()
Note that the transaction methods are part of the connection
class, not the cursor
class.
If something goes wrong, like an exception is thrown that you can handle, you should call the rollback
method to undo the effects of the incomplete transaction; this will restore the database to the state it was in before you started the transaction, guaranteed:
conn.rollback()
The capability to roll back a transaction is very important, because you can handle errors by ensuring that the database does not get changed. In addition, rollbacks are very useful for testing. You can insert, modify, and delete a number of rows as part of a unit test and then roll back the transaction to undo the effects of all the changes. This enables your unit tests to run without making any permanent changes to the database. It also enables your unit tests to be run repeatedly, because each run resets the data.
See Chapter 12 for more on testing.
The DB API defines several globals that need to be defined at the module level. You can use these globals to determine information about the database module and the features it supports. The following table lists these globals.
Global | Holds |
---|---|
Apilevel | Should hold '2.0' for the DB API 2.0, or '1.0' for the 1.0 API. |
Paramstyle | Defines how you can indicate the placeholders for dynamic data in your SQL statements. The values include the following:
|
In addition, remember that pydoc is your friend. You can use pydoc to display information on modules, such as the database modules.
With a cursor
object, you can check the definition attribute to see information about the data returned. This information should be a set of seven-element sequences, one for each column of result data. These sequences include the following items:
(name, type_code, display_size, internal_size, precision, scale, null_ok
None
can be used for all but the first two items, as shown in this example:
(('FIRSTNAME', None, None, None, None, None, None), ('LASTNAME', None, None, None, None, None, None), ('NAME', None, None, None, None, None, None))
Errors happen. With databases, errors happen a lot. The DB API defines a number of errors that must exist in each database module. The following table lists these exceptions.
Exception | Usage |
---|---|
Warning | Used for non-fatal issues. Must subclass StandardError. |
Error | Base class for errors. Must subclass StandardError. |
InterfaceError | Used for errors in the database module, not the database itself. Must subclass Error. |
DatabaseError | Used for errors in the database. Must subclass Error. |
DataError | Subclass of DatabaseError that refers to errors in the data. |
OperationalError | Subclass of DatabaseError that refers to errors such as the loss of a connection to the database. These errors are generally outside of the control of the Python scripter. |
IntegrityError | Subclass of DatabaseError for situations that would damage the relational integrity, such as uniqueness constraints or foreign keys. |
InternalError | Subclass of DatabaseError that refers to errors internal to the database module, such as a cursor no longer being active. |
ProgrammingError | Subclass of DatabaseError that refers to errors such as a bad table name and other things that can safely be blamed on you. |
NotSupportedError | Subclass of DatabaseError that refers to trying to call unsupported functionality. |
Your Python scripts should handle these errors. You can get more information about them by reading the DB API specification. See www.python.org/topics/database/
and www.python.org/peps/pep-0249.html
for more information.
Databases provide a handy means for storing data. You can write Python scripts that can access all the popular databases using add-on modules. This chapter provided a whirlwind tour of SQL, the Structured Query Language, and covered Python's database APIs.
You also learned about the dbm modules that enable you to persist a dictionary using a variety of dbm libraries. These modules enable you to use dictionaries and transparently persist the data.
In addition, this chapter covered the Python database APIs, which define a standard set of methods and functions that you should expect from all database modules. This includes the following:
A connection
object encapsulates a connection to the database. Use the connect
function on the database module to get a new connection. The parameters you pass to the connect
function may differ for each module.
A cursor provides the main object for interacting with a database. Use the connection
object to get a cursor. The cursor enables you to execute SQL statements.
You can pass dynamic data as a tuple of values to the cursor execute
method. These values are placed into your SQL statements, enabling you to create reusable SQL statements.
After performing a query operation, the cursor
object holds the data. Use the fetchone
or fetchall
methods to extract the data.
After modifying the database, call commit
on the connection to commit the transaction and save the changes. Use the rollback
method to undo the changes.
Call close
on each cursor when done. Call close
on the connection when done.
The DB APIs include a defined set of exceptions. Your Python scripts should check for these exceptions to handle the variety of problems that may arise.
Chapter 15 covers XML, HTML, and XSL style sheets, technologies frequently used for web development.
Suppose you need to write a Python script to store the pizza preferences for the workers in your department. You need to store each person's name along with that person's favorite pizza toppings. Which technologies are most appropriate to implement this script?
Set up a relational database such as MySQL or Sqlite.
Use a dbm module such as dbm.
Implement a web-service-backed rich web application to create a buzzword-compliant application.
Rewrite the following example query using table name aliases:
select employee.firstname, employee.lastname, department.name from employee, department where employee.dept = department.departmentid order by employee.lastname desc
The terminate.py
script, shown previously, removes an employee row from the employee table; but this script is not complete. There remains a row in the user table for the same person. Modify the terminate.py
script to delete both the employee and the user table rows for that user.
18.191.150.231