First Steps

A database schema revolves around tables, so the first step in creating a schema is to determine the tables that are needed and then determine how they relate to each other. To do this, you work forward from the requirements, data modeling, and DTD design work you've already completed. Let's return to our simple e-mail application. As in all stages of application development, we must review the requirements (we wrote these in Chapter 3; refer to that chapter if you need to review the process of gathering requirements):

  • A user must be able to write an e-mail message.

  • A user must be able to send e-mail to another user.

  • A user must be able to read e-mail sent to him from another user.

  • An e-mail will consist of a subject line and a message body.

And because we have a data model diagram (from Chapter 4), let's review that as well (see Figure 6-1).

Figure 6-1. Data modeling diagram for e-mail system


Finally, we have to consider the DTD we built in Chapter 5. Refer to Appendix A for the complete CyberCinema DTD.

Looking at our data model diagram in Figure 6-1, we're going to need, at the bare minimum, a user table and an e-mail message table, a table for each of the data entities represented in the diagram. Because the DTD we built in Chapter 5 is built around the e-mail messages themselves, the XML instances (the e-mail messages themselves) are stored in the e-mail messages table.

XML Storage

I tend to store things like XML and images in the database itself. Another approach is to store so-called “large objects” like these on the file system and reference them from the database; that is, instead of including the actual XML in the database, include a path name to the location of the XML file in the file system.

I don't recommend this second approach to XML storage for a number of reasons:

  1. It isn't as scalable as providing the XML in the database. In order to access an XML file when using the file system reference solution, you first have to make a database call and then actually go to the file system to get the file. This operation can be quite expensive to perform, and you haven't even started parsing your XML file yet.

  2. The file system reference solution doesn't lend itself well to replication. Creating a “replicant” database (for disaster recovery purposes, for instance) is more difficult.

  3. You can't write XML logic into the database. It's sometimes desirable to add to the database an XML processing layer (for instance, a trigger that performs XML decomposition, which we discuss in the section Enter Partial Decomposition later in this chapter). When your XML is outside the database, writing an XML processing layer isn't an option or is more difficult and/or less scalable.

  4. Backups are more reliable when the XML is part of the database. It makes your backup strategy simpler and more foolproof, if all your data is in the same place, instead of having some data located in one place (the database) and the rest located somewhere else (the file system). Storing XML in the database means your backups are more likely to work, and when you have to recover your database from backups, one less procedure is required.

The database schema designs presented in this book assume large object storage in the database. If, for whatever misguided reason, you decide to put your large objects on the file system, the principles I'm going to present are still quite valid. Just don't come crying to me.


So what does the database schema look like? As mentioned previously, we need a table for our e-mail messages and a table for our users. Both tables will contain a numerical ID as their primary key field. That leaves us with the following database schema, represented here by SQL CREATE TABLE statements (the statements you would use as the database administrator to create the tables):

CREATE TABLE message (
       message_id       INTEGER(16)      NOT NULL,
       message_xml      CLOB             NOT NULL,
);
CREATE TABLE user (
       user_id          INTEGER(16)      NOT NULL,
       user_name        VARCHAR(64)      NOT NULL
);

SQL Examples

I'm using “pseudo-SQL” for the printed examples throughout this book. You can't expect to type in these statements and see actual tables materialize in your database. “Standard” SQL (which I sometimes pronounce “squeal” as in “squeal like a pig”) varies considerably depending on what database product you're using. I've used Sybase and Oracle and SQL Server, and I have a smattering of experience with Informix and MySql. The same principles should apply to all these systems. Database schema definition files for Oracle are provided on the CD that accompanies this book.

For all examples in which I'm storing XML in the database, I'm using the CLOB datatype. CLOB stands for Character Large OBject (nobody ever said software engineers could spell) and is one of SQL1999's new datatypes. If your database doesn't support CLOBs, you could use any character-oriented datatype your database does support. For instance, some databases have a TEXT datatype that serves this purpose. Note that use of these datatypes (as compared to CHAR or VARCHAR) often means you take a performance hit. The advantage of using CLOB datatypes is that you aren't limited to a certain size (although there is usually an upper bound on CLOB size—approximately 1MB).


I'm also using unique numerical ID numbers quite a lot (as in the e-mail example, with message_id). The underlying assumption is that something is creating these unique IDs. This capacity is built in to some databases; others require that you write a trigger to assign the ID when a new row is inserted. Either way, the ID usually comes from a sequence produced within the database, thereby guaranteeing uniqueness.

Great! We're done!

Wait a second, hold the phone! We're not leveraging the power of our RDBMS. What fools these mortals be! What we have from the previous example is great as long as the database doesn't have to be able to answer the request “list the e-mail messages I've received”—an essential request in any e-mail system. Databases are smart, but they can't tell you what they don't know. If you want your relational database to answer requests like this—to “know” about your data instead of simply storing it—you need to tell it about your data in the first place. In XML terminology, you need to decompose the XML into your relational tables.

SQL and XML: The Joys of Partial Decomposition

XML is great at structuring data. Relational databases are great at storing and relating data. How do you get the best out of these two different ways of looking at data? You can create a table that contains a row for each part of your XML document. For example, take the following XML fragment:

The <b>quick</b> brown <i>fox</i> jumped over the <u>lazy</u> dog.

Some database packages implement a “persistent parse tree” database schema to store XML. What this means is that the entire XML document is decomposed, element by element, into individual nodes (see Figure 2-2 from Chapter 2), and each node is stored as a separate row in a database table, a table not specific to any particular XML language but able to accept any XML data. With this scheme, you need up to 13 rows of a table just to represent the preceding line of XML code. Each node of the tree is put in its own row, with relations built to the other nodes to reflect the original structure and order of the XML. The original XML document is essentially disintegrated, broken into its constituent parts and stored atomically. Yes, you could answer the question “Which words in all of my documents are bold?” But in order to do so, your database has to select across all these rows. In addition, such database schemas don't lend themselves well to indexing. And what about relational integrity? In the following fragment, how do you relate the reference to Charade back to a relational table of movie titles kept elsewhere in your database?

<movie id="3234">Charade</movie> is one of Cary Grant's later and least well
known films.

Another method for integrating XML with a SQL database is XML decomposition. An XML decomposition database schema takes the opposite approach to persistent parse trees by creating a schema that entirely encompasses your XML documents. Every element, as well as every single attribute of each element, has a separate table. Rigid relational integrity is maintained, but at the price of flexibility. Also, unstructured data (such as a flow of text with tags embedded at various points—like our CyberCinema reviews) is very difficult to support. With full decomposition, you throw away the XML when you decompose it into the SQL database because you don't need it any more. You've fully encoded your information in the database, and you should be able to recreate your XML instance if needed from that stored information. However, XML schema evolution (such as adding to your DTD) means SQL schema evolution, which can then lead to changing application code. Therefore decomposition is useful only when you have a rigidly defined XML schema that you know will never change, which is never.

Both persistent parse trees and decomposition are misguided attempts to integrate XML and SQL tightly.

Enter Partial Decomposition

In contrast to the methods for XML/SQL integration described previously, partial decomposition is a simple method that can help manage some of the complexity inherent in a mixed XML/SQL-based system. The components of partial decomposition are the following:

  • Store the entire XML instance in the database.

  • Keep the XML instance as the primary source of data.

  • Extract selected data from the XML, and place it in relational tables for quick access and relational constraint management.

Partial decomposition really struts its stuff in the application of these rules. With partial decomposition, we take a different approach: Let XML do what it's good at (structuring content), and let SQL do what it's good at (enforcing relational constraints and organizing data).

Partial decomposition is a loose integration of XML and SQL. You can still use persistent parse trees with partial decomposition. In fact, it doesn't matter how you store the XML itself. The key to partial decomposition is figuring out what this selected data that you want to store in the database is. The selected data depends on the purpose of the application and the data.

Let's go back to two strengths of SQL databases.

Enforcing Relational Constraints

With e-mail messages, you want your database to ensure that the To and From fields refer to real users. Relational databases can do this for you if you design your partial decomposition schema to include a user table and constraints that the To and From fields must refer to rows in this user table. Hence, we need To and From fields (that is, columns) in our e-mail table.

Organizing Data for Easy Recall

Organizing data really means answering questions about your data. The question you need to ask yourself is “What questions is my application going to ask the database?” In this case, we want to be able to answer the request: “List the e-mail messages a particular user has received.” The To and From columns already solve this problem for e-mail messages.

We end up with the following schema, represented by SQL create statements:

CREATE TABLE message (
       message_id     INTEGER(16)    NOT NULL       PRIMARY KEY,
       message_xml    CLOB           NOT NULL,
       message_from   INTEGER(16)    REFERENCES user(user_id)      NOT NULL,
       message_to     INTEGER(16)    REFERENCES user(user_id)      NOT NULL,
);
CREATE TABLE user (
       user_id        INTEGER(16)    NOT NULL       PRIMARY KEY,
       user_name      VARCHAR(64)    NOT NULL
);

First, for our e-mail messages we're building a simple table that has a unique numerical key (which is important) and stores the whole of each XML instance in a CLOB field. The other two columns in the e-mail message table, To and From, are numerical and refer to rows in the other table we're creating—a user table.

CLOBs Part Two

In using CLOBs, it is often desirable (depending on your database server) to isolate the CLOBs into their own table for performance reasons. Applying this to the previous example, we get:

CREATE TABLE message_content {
   message_id    INTEGER(16)   REFERENCES message(message_id) NOT
 NULL,
   message_xml   CLOB          NOT NULL
}

CREATE TABLE message {
   message_id    INTEGER(16)   PRIMARY KEY NOT NULL,
   message_from  INTEGER(16)   REFERENCES user(user_id) NOT NULL,
   message_to    INTEGER(16)   REFERENCES user(user_id) NOT NULL
}

In this case, we've stored everything about the message in the message table itself, but we put the content of the message to the side in a message_content table. This approach is sometimes faster; the DB engine can cache more of the message_info table because it doesn't have to store the CLOBs in memory with the other data. Hence, queries against the message_info table are faster than queries against your combined table. In practice, a well-tuned application would not even load the XML into memory until required. The efficacy of all this depends on your database vendor and the application logic that you're connecting to it. A skilled database administrator can help you determine which approach will make the most sense for your data and your application.


The mechanism of partial decomposition now kicks in: Whenever you insert a new e-mail message into the message table or update a message already there, you have to examine the XML instance, decompose the information you're interested in, and insert or update the corresponding rows in your relational schema. This can happen in your application code as part of a transaction or (in databases that allow for complex coding within the database itself) as a trigger on the table. If you violate an integrity constraint (for example, your e-mail message XML instance contains a From field with a user ID that doesn't correspond to a user in the relational user table), you roll back the whole transaction and kick back an error. Relational integrity is maintained. If you want a list of a user's e-mail messages, a simple SQL query is all that is required:

select * from message where message_to = 3

To read the actual message, you pull out the XML and transform it as necessary for the display medium you're using.

Using Partial Decomposition as a Caching Strategy

Another way to use the tool of partial decomposition is to extract bits of data you want close at hand in your relational table. For instance, it is preferable to get the subject line of each message from the database, instead of having to retrieve each XML instance (that is, the e-mail message), parse it, and pull out each subject every time you have to display a list of messages.

Add the message subject line to our relational schema, like so:

CREATE TABLE message (
      message_id      INTEGER(16)    NOT NULL       PRIMARY KEY,
      message_xml     CLOB           NOT NULL,
      message_subject VARCHAR(255)   NOT NULL,
      message_from    INTEGER(16)    REFERENCES user(user_id)    NOT NULL,
      message_to      INTEGER(16)    REFERENCES user(user_id)    NOT NULL,
);

In addition, adding subject line decomposition to our partial decomposition, we can do just that.

Data Synchronization with Partial Decomposition

Whenever you have data mirrored in two places at once, the data could become corrupt, that is, the data in your relational table might not match the data in your XML file. This can happen for any number of reasons; for instance, the routines you're using for partial decomposition could fail halfway through, leaving your data only partially decomposed. Following are some strategies for keeping your data in sync:

  • Remember to keep XML in the driver's seat. Don't be tempted to update the rows of your table directly from your application code. Let your partial decomposition routines continue to do this based on your XML. By compartmentalizing the population of these tables in one segment of your code, the decomposition, you ensure that the data in the tables is always a mirror of the data in the XML itself. The system is also easier to maintain; when you want to add something to your decomposition, all the code is in one place (either as a part of your application code or in the database itself).

  • Use database transactions. If your code first inserts or updates an XML instance in your database and then inserts or updates rows associated with this XML instance, make this part of one database transaction. You can start a transaction, do all the inserts and updates you want, and then end the transaction, committing all changes at once. That way, if anything goes wrong halfway through, the database automatically rolls back all your changes and kicks back an exception, which you can catch and do something intelligent with (such as writing an error log that can alert support staff of a potential problem before your users start complaining about it).

  • Build data management utilities. You should have a way to rebuild all the partial decomposition information from your XML, and you should do this on a regular basis (depending on how much data you have). Rebuilding the partial decomposition information from your XML continually maintains consistent data and alerts you to potential problems before they balloon into big cleanup jobs. Your rebuild should be resilient enough to flag errors and then move on so that you can run it overnight and fix problems in the morning. You also should build small utilities for bulk deletes and bulk updates.

In general, think of the data stored in the partial decomposition tables as an index, not as the actual data. The data really is stored in only one place: the XML.


Problems with Partial Decomposition

Partial decomposition is especially geared toward write-sometimes-read-mostly applications. Whenever you make changes to your data, the data has to go through the partial decomposition step, which (depending on the complexity of your data) can be costly in terms of performance. If you made only one change, it's also wasteful because all the characteristics from your XML instance are decomposed each time you change it. Document management systems, such as the one we're describing in the CyberCinema example, are particularly suited to this approach because once a document is published, it basically sits there and is read repeatedly.

Another problem with partial decomposition is with schema evolution (using the word “schema” in a general sense—anything relating to changing your data model). If your data must be changed and, therefore, your DTD must change, your schema for partial decomposition may also have to change. The good news is that it doesn't always have to change, only when you want to decompose new elements. And even then, it's an additive process.

It's important to keep these issues in mind when you're analyzing the performance characteristics of your application. Of course, the subject of application performance is a more complex issue, and entire books are dedicated to the subject of performance analysis and writing “performant” code (like “performant” is a real word—they're not fooling anyone).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.30.162