COLLATIONS IN SQL

SQL’s rules regarding type checking and coercion, in the case of character strings in particular, are (sadly) rather more complex than I’ve been pretending so far, and I need to elaborate somewhat. Actually it’s impossible in a book of this nature to do more than just scratch the surface of the matter, but the basic idea is this: Any given character string (a) consists of characters from one associated character set and (b) has one associated collation. A collation—also known as a collating sequence—is a rule that’s associated with a specific character set and governs the comparison of strings of characters from that character set. Let C be a collation for character set S, and let a and b be any two characters from S. Then C must be such that exactly one of the comparisons a < b, a = b, and a > b evaluates to TRUE and the other two to FALSE (under C). Note: In early versions of SQL there was just one character set, that character set had just one collation, and that collation was based on the numerical order of the binary codes used to represent the characters in that character set. But there’s no intrinsic reason why collating sequences should have to depend on internal coding schemes, and there are good practical reasons why they shouldn’t.

So much for the basic idea. However, there are complications. One arises from the fact that any given collation can have either PAD SPACE or NO PAD defined for it. Suppose the character strings ‘AB’ and ‘AB’ (note the trailing space in the second of these) have the same character set and the same collation. Then those two strings are clearly distinct, and yet they’re considered to “compare equal” if PAD SPACE applies.Recommendation: Don’t use PAD SPACE—always use NO PAD instead, if possible. Note, however, that the choice between PAD SPACE and NO PAD affects comparisons only—it makes no difference to assignments.[35]

Another complication arises from the fact that the comparison a = b might evaluate to TRUE under a given collation, even if the characters a and b are distinct. For example, we might define a collation called CASE_INSENSITIVE in which each lowercase letter is defined to compare equal to its uppercase counterpart. As a consequence, again, strings that are clearly distinct will sometimes compare equal.

We see, therefore, that certain comparisons of the form v1 = v2 can give TRUE in SQL even if v1 and v2 are distinct (and possibly even if they’re of different types, thanks to SQL’s support for coercion). I’ll use the term “equal but distinguishable” to refer to such pairs of values. Now, equality comparisons are performed, often implicitly, in numerous contexts—examples include MATCH, LIKE, UNIQUE, UNION, and JOIN—and the kind of equality involved in all such cases is indeed “equal even if distinguishable.” For example, let collation CASE_INSENSITIVE be as defined above, and let PAD SPACE apply to that collation. Then, if the PNO columns of tables P and SP both use that collation, and if ‘P2’ and ‘p2’ are PNO values in, respectively, some row of P and some row of SP, those two rows will be regarded as satisfying the foreign key constraint from SP to P, despite the lowercase ‘p’ and trailing spaces in the foreign key value.

What’s more, when evaluating expressions involving operators such as UNION, INTERSECT, EXCEPT, JOIN, GROUP BY, DISTINCT (and so on), the system sometimes has to decide which of several equal but distinguishable values is to be chosen as the value of some column in some result row. Unfortunately, SQL itself fails to give complete guidance in such situations. As a consequence, certain table expressions are indeterminate—the SQL term is possibly nondeterministic—in the sense that SQL doesn’t fully specify how they should be evaluated; indeed, they might quite legitimately give different results on different occasions. For example, if collation CASE_INSENSITIVE applies to column C in table T, then SELECT MAX(C) FROM T might return ‘ZZZ’ on one occasion and ‘zzz’ on another, even if T hasn’t changed in the interim.

I won’t give SQL’s rules here for when a given expression is “possibly nondeterministic” (see Chapter 12 for further discussion). It’s important to note, however, that such expressions aren’t allowed in integrity constraints (see Chapter 8), because they could cause updates to succeed or fail unpredictably. Observe in particular, therefore, that this rule implies among other things that many table expressions—even simple SELECT expressions, sometimes—aren’t allowed in constraints if they involve a column of some character string type! Strong recommendation: Avoid possibly nondeterministic expressions as much as you can.



[35] As a historical note, I remark that in the original (i.e., IBM) version of SQL, the only available collation—which was based on the internal coding scheme, of course—supported PAD SPACE only, and did that only tacitly. The reason for this state of affairs was a desire to conform to the corresponding rules for PL/I.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.215.1