THE RELATIONAL MODEL DEFINED

Now I’d like to give a precise definition of just what it is that constitutes the relational model. The trouble is, the definition I’ll give is indeed reasonably precise: so much so, in fact, that I think it would have been pretty hard to understand if I’d given it in Chapter 1. (As Bertrand Russell once memorably said: Writing can be either readable or precise, but not at the same time.) Now, I did give a definition in Chapter 1—a definition, that is, of what I there called “the original model”—but I frankly don’t think that definition is even close to being good enough, for the following reasons among others:

  • For starters, it was much too long and rambling. (Well, that was fair enough, given the intent of that preliminary chapter; but now I want a definition that’s reasonably succinct, as well as being precise.)

  • I don’t really much care for the idea that the model should be thought of as consisting of “structure plus integrity plus manipulation”; in some ways, in fact, I think it’s actively misleading to think of it in such terms. The truth is, those three aspects of the model are inextricably intertwined. For example, the relvars in any given database (structural piece) will be subject to a variety of integrity constraints (integrity piece), and those constraints will be expressed using a variety of relational operators (manipulative piece). Moreover, those relvars will (or should) have been designed in accordance with relational design theory, which likewise involves all three pieces. And of course they’ll be subject to update, which again involves all three pieces.

  • “The original model” included a few things I’m not too comfortable with: for instance, divide, nulls, the entity integrity rule, the idea of being forced to choose one key and make it primary, and the idea (still argued on occasion) that domains and types might somehow be different things. Regarding nulls, incidentally, I note that Codd invented the relational model in 1969 and didn’t introduce nulls until 1979; in other words, the model managed perfectly well—in my opinion, better—for some ten years without any notion of nulls at all. What’s more, early languages and implementations managed perfectly well without them, too.

  • The original model also omitted a few things I now consider vital. For example, it excluded any mention—at least, any explicit mention—of all of the following: predicates, constraints (other than key and foreign key constraints), relation variables, relational comparisons, relation type inference and associated features, image relations, certain algebraic operators (especially rename, extend, summarize (?), semijoin, and semidifference), and the important relations TABLE_DUM and TABLE_DEE. (On the other hand, I think it could fairly be argued that these features at least weren’t precluded by the original model; it might even be argued in some cases that they were in fact included, in a kind of embryonic form. For example, it was certainly always intended that implementations should include support for constraints other than just key and foreign key constraints. Relational comparisons too were at least implicitly required, even in Codd’s very first paper.)

    Without further ado, then, let me give my own preferred definition.

    Definition: The relational model consists of five components:

    1. An open ended collection of scalar types, including type BOOLEAN in particular[179]

    2. A relation type generator and an intended interpretation for relations of types generated thereby

    3. Facilities for defining relation variables of such generated relation types

    4. A relational assignment operator for assigning relation values to such relation variables

    5. A relationally complete (but otherwise open ended) collection of generic relational operators for deriving relation values from other relation values

The following subsections elaborate on each of these components in turn. First, however, a word of caution. As John Muir once said, when we try to pick out anything by itself, we find it hitched to everything else in the universe (often quoted in the form “Everything is connected to everything else”). John Muir was referring to the natural world, of course, but he might just as well have been talking about the relational model.[180] The fact is, the various features of the relational model are highly interconnected—remove just one of them, and the whole edifice crumbles. Translated into concrete terms, this metaphor means that if we build a “relational” DBMS that fails to support some aspect of the model, the resulting system (which shouldn’t really be called relational, anyway) will be bound to display behavior on occasion that’s certainly undesirable and possibly unforeseeable. I can’t stress the point too strongly: Every feature of the model is there for solid practical reasons; if we choose to ignore some detail, then we do so at our own peril.

Scalar Types

Scalar types can be either system defined or user defined, in general; thus, a means must be available for users to define their own scalar types (this requirement is implicit in the fact that the set of scalar types is open ended). A means must therefore also be available for users to define their own operators, since types without operators are useless. The set of system defined scalar types is required to include type BOOLEAN—the most fundamental type of all, containing precisely two values (viz., the truth values TRUE and FALSE)—but a real system will surely support others as well (INTEGER, CHAR, and so on). Support for type BOOLEAN implies support for the usual logical operators (NOT, AND, OR, and so on) as well as other operators, system or user defined, that return boolean values. In particular, the equality comparison operator “=” (which is a boolean operator by definition) must be available in connection with every type, nonscalar as well as scalar, for without it we couldn’t even say what the values are that constitute the type in question. What’s more, the model prescribes the semantics of that operator, too. To be specific, if v1 and v2 are values of the same type, then v1 = v2 returns TRUE if v1 and v2 are the very same value and FALSE otherwise.

Aside: The following is a logical consequence of the foregoing definition of equality that can be very helpful in practice. Let Op be an operator with a parameter P; let P be of type T, so that the argument corresponding to P in any given invocation of Op is also of type T; and let v1 and v2 be values of type T. If two successful invocations of Op that are identical in all respects except that the argument corresponding to P is v1 in one invocation and v2 in the other are somehow distinguishable in their effect, then v1 = v2 will (in fact, must) evaluate to FALSE. End of aside.

Let T be some scalar type. Associated with type T, then, there’s at least one selector operator, with the properties that (a) every invocation of that operator returns a value of type T and (b) every value of type T is returned by some invocation of that operator (more specifically, by some corresponding literal—recall that a literal is a special case of a selector invocation).

Relation Types

The relation type generator allows users to specify individual relation types as desired: for example, as the type for some relation variable or some relation valued attribute. The intended interpretation for a given relation of a given type, in a given context, is as a set of propositions; each such proposition (a) constitutes an instantiation of some predicate that corresponds to the relation heading, (b) is represented by a tuple in the relation body, and (c) is assumed to be true. If the context in question is some relvar—that is, if we’re talking about the relation that happens to be the current value of some relvar—then the predicate in question is the relvar predicate for that relvar. Relvars in particular are interpreted in accordance with The Closed World Assumption (see later in this appendix, also Chapter 5).

Let RT be some relation type. Associated with RT, then, there’s a relation selector operator, with the properties that (a) every invocation of that operator returns a relation of type RT and (b) every relation of type RT is returned by some invocation of that operator (more specifically, by some relation literal). Also, since the equality comparison operator “=” is available in connection with every type, it’s available in connection with type RT in particular. So too is the relational inclusion operator “⊆”; if relations r1 and r2 are of the same type, then r1 is included in r2 if and only if the body of r1 is a subset of that of r2.

Relation Variables

As noted above, one use—a particularly important one—for the relation type generator is in specifying the type of a relation variable, or relvar, when that relvar is defined. Such a variable is the only kind permitted in a relational database; all other kinds of variables, scalar variables or tuple variables or any other kind, are prohibited. (In programs that access such a database, by contrast, they’re not prohibited—in fact, they’re probably required.)

The statement that the database contains nothing but relvars is one possible formulation of what Codd originally called The Information Principle, though I don’t think it’s a formulation he ever used himself. Instead, he usually stated the principle like this:

The entire information content of the database at any given time is represented in one and only one way: namely, as explicit values in attribute positions in tuples in relations.

I heard Codd refer to this principle on more than one occasion as the fundamental principle underlying the relational model. Any violation of it thus has to be seen as serious.[181] Database tables that involve top to bottom row ordering or left to right column ordering, or contain duplicate rows, or pointers, or nulls, or have anonymous columns or duplicate column names, all constitute such violations. But why is the principle so important? The answer is bound up with the observations I made in Chapter 5 to the effect that (along with types) relations are both necessary and sufficient to represent any data whatsoever at the logical level. In other words, the relational model gives us everything we need in this respect, and it doesn’t give us anything we don’t need.

I’d like to pursue this point a moment longer. In general, it’s axiomatic that if we have n different ways of representing data, then we need n different sets of operators. For example, if we had arrays as well as relations, we’d need a full complement of array operators as well as a full complement of relational ones.[182] If n is greater than one, therefore, we have more operators to implement, document, teach, learn, remember, and use (and choose among). But those extra operators add complexity, not power! There’s nothing useful that can be done if n is greater than one that can’t be done if n equals one (and in the relational model, of course, n does equal one).

What’s more, not only does the relational model give us just one construct, the relation itself, for representing data, but that construct is—to quote Codd himself (see the section OBJECTIVES OF THE RELATIONAL MODEL, later in this appendix)—of spartan simplicity: It has no ordering to its tuples, it has no ordering to its attributes, it has no duplicate tuples, it has no pointers, and (at least as far as I’m concerned) it has no nulls. Any contravention of these properties is tantamount to introducing another way of representing data, and therefore to introducing more operators as well. In fact, SQL is living proof of this observation. For example, SQL has nine different union operators (and ought by rights to have 18, if not 27), while the relational model has just one.

Aside: Perhaps I should explain these last remarks. First of all, SQL supports six different unions for tables as such—UNION DISTINCT, UNION DISTINCT CORRESPONDING, UNION DISTINCT CORRESPONDING BY, and three variants on these in which DISTINCT is replaced by ALL. The funny thing is, the one kind of union it doesn’t support for tables as such is true bag union! For the record, here’s the definition: Let b1 and b2 be bags; let x appear exactly n1 times in b1 and exactly n2 times in b2; and let b be the bag union of b1 and b2. Then x appears exactly n times in b, where—to adopt a self-explanatory notation—n = MAX(n1,n2). So SQL ought by rights to support BAG (or some such keyword) as an alternative to DISTINCT and ALL, giving us three more unions. Nine so far. Next, SQL supports two different unions for what it calls “multiset values” (as opposed to tables), viz., MULTISET UNION DISTINCT and MULTISET UNION ALL; but it ought really to support seven more possibilities here too (involving BAG, CORRESPONDING, and so on), at least if those “multiset values” are multisets of rows specifically. Now we’re up to 18. Finally, SQL also supports a union “set function” (for use in summarization), though it calls it not UNION but FUSION. FUSION has no variants, but by rights the same possibilities that apply to tables should apply here too (again, if the “multiset values” in question are multisets of rows specifically). Total: 27.[183] End of aside.

As you can see, then, The Information Principle is certainly important—but it has to be said that its name hardly does it justice. Other names that have been proposed, mainly by Hugh Darwen or myself or both, include The Principle of Uniform Representation and The Principle of Uniformity of Representation. (This latter is clumsy, I admit, but at least it’s accurate.)

There’s one more point I should mention under the heading of “Relation Variables.” As Darwen and I demonstrate in our book on The Third Manifesto, the database isn’t really just “a container for relvars,” even though we usually talk about it as if it were. Rather, it’s a variable. After all, it can certainly be updated—and that means it’s a variable by definition! Logically speaking, in other words, the database in its entirety is one (typically rather large) variable in itself, which we might call a dbvar. I’ll elaborate on this concept in the section DATABASE VARIABLES later in this appendix.

Relational Assignment

Like the equality comparison operator “=”, the assignment operator “:=” must be available in connection with every type (for without it we would have no way of assigning values to a variable of the type in question), and again relation types are no exception to this rule. The operators INSERT, DELETE, and UPDATE (likewise D_INSERT and I_DELETE) are permitted and indeed useful, but strictly speaking they’re only shorthands. What’s more, support for relational assignment (a) must include support for multiple relational assignment in particular and (b) must abide by both The Assignment Principle and The Golden Rule.

Relational Operators

The “generic relational operators” are the operators that make up the relational algebra (or something logically equivalent to the algebra), and they’re therefore built in—though there’s no inherent reason why users shouldn’t be able to define additional operators of their own, if desired. Precisely which operators are included isn’t specified, but they’re required to provide, in their totality, at least the expressive power of the relational calculus. (In other words, they’re required to be relationally complete—see further discussion below.)

Now, there seems to a widespread misconception concerning the purpose of the algebra. To be specific, many people seem to think it’s meant just for writing queries—but it’s not; rather, it’s for writing relational expressions. Those expressions in turn serve many purposes, including query but certainly not limited to query alone. Here are some other important ones (this isn’t an exhaustive list):

  • Defining views and snapshots

  • Defining the set of tuples to be inserted into, deleted from, or updated in, some relvar (or, more generally, defining the set of tuples to be assigned to some relvar)

  • Defining constraints (though here the relational expression in question will be just a subexpression of some boolean expression, typically but not invariably an IS_EMPTY invocation)

  • Serving as a basis for investigations into other areas, such as optimization and database design

The algebra also serves as a kind of yardstick against which the expressive power of database languages can be measured. Essentially, a language is said to be relationally complete if and only if it’s at least as powerful as the algebra (or the calculus—it comes to the same thing), meaning its expressions permit the definition of every relation that can be defined by means of expressions of the algebra (or the calculus). As noted in Chapter 10, relational completeness is a basic measure of the expressive capability of a language; if a language is relationally complete, it means (among other things, and speaking a trifle loosely) that queries of arbitrary complexity can be formulated without having to resort to branching or iterative loops. In other words, as I also said in that earlier chapter, it’s relational completeness that allows end users—at least in principle, though possibly not in practice—to access the database directly, without having to go through the potential bottleneck of the IT department.



[179] As explained in Chapter 2, the relational model doesn’t rely on the scalar vs. nonscalar distinction in any formal sense. I appeal to it here (as elsewhere in this book) merely as an aid to intuition.

[180] I’ve already said it’s misleading to think of the relational model as structure plus integrity plus manipulation because all three aspects are inextricably intertwined; well, of course, the same goes for the five components described in the following subsections, to some extent.

[181] It goes without saying that object databases, XML databases, and more generally nonrelational databases of any kind, do all violate it, necessarily.

[182] We’d also have to choose which data we wanted to represent as relations and which as arrays, probably without any good guidelines to help us in making such choices. And what about the catalog? Would it contain relations, or arrays? Or a mixture?

[183] To all of the foregoing I’d like to add a comment Hugh Darwen once made to me (in a private communication): “UNION CORRESPONDING was added to SQL in 1992, presumably to fill some perceived gap in functionality. Suppose it had been part of the language as originally defined; when if ever would the need have emerged to introduce a UNION based on left to right column ordering instead?” I note too that questions like this one apply to a whole host of constructs that have been added to SQL since it was first defined.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.89.82