Chapter 6
IN THIS CHAPTER
Executing SQL statements
Using (and misusing) reserved words
Working with SQL’s data types
Handling null values
Applying constraints
In this chapter, I get into the nitty-gritty of SQL. This is knowledge you need to master before you embark on actually writing SQL statements. SQL has some similarities to computer languages you may already be familiar with, and some important differences. I touch on some of these similarities and differences right here in this chapter, but will discuss others later when I get to the appropriate points in a complete discussion of SQL.
SQL is not a complete language, but a data sublanguage. As such, you cannot write a program in the SQL language like you can with C or Java. That doesn’t mean SQL is useless, though. There are several ways that you can use SQL. Say you have a query editor up on your screen and all you want is the answer to a simple question. Just type an SQL query, and the answer, in the form of one or more lines of data, appears on your screen. This mode of operation is called interactive SQL.
If your needs are more complex, you have two additional ways of making SQL queries:
Interactive SQL consists of entering SQL statements into a database management system such as SQL Server, Oracle, or DB2. The DBMS then performs the commands specified by the statements. You could build a database from scratch this way, starting with a CREATE DATABASE
statement, and building everything from there. You could fill it with data, and then type queries to selectively pull information out of it.
Although it’s possible to do everything you need to do to a database with interactive SQL, this approach has a couple of disadvantages:
SQL is the only language that most relational databases understand, so there is no getting around using it. However, the people who interact with databases the most — those folks that ask questions of the data — do not need to be exposed to naked SQL. They can be protected from that intimidating prospect by wrapping the SQL in a blanket of code written in another language. With that other language, a programmer can generate screens, forms, menus, and other familiar objects for the user to interact with. Ultimately, those things translate the user’s actions to SQL code that the DBMS understands. The desired information is retrieved, and the user sees the result.
SQL has these fundamental differences from host languages that you might want to combine it with:
CAST
statement for translating one data type into another.)Until recently, the most common form of SQL has been embedded SQL. This method uses a general-purpose computer language such as C, C++, or COBOL to write the bulk of an application. Such languages are great for creating an application’s user interface. They can create forms with buttons and menus, format reports, perform calculations, and basically do all the things that SQL cannot do. In a database application, however, sooner or later, the database must be accessed. That’s a job for SQL.
It makes sense to write the application in a host language and, when needed, drop in SQL statements to interact with the data. It is the best of both worlds. The host language does what it’s best at, and the embedded SQL does what it’s best at. The only downside to the cooperative arrangement is that the host language compiler will not recognize the SQL code when it encounters it and will issue an error message. To avoid this problem, a precompiler processes the SQL before the host language compiler takes over. When everything works, this is a great arrangement. Before everything works, however, debugging can be tough because a host language debugger doesn’t know how to handle any SQL that it encounters. Nevertheless, embedded SQL remains the most popular way to create database applications.
For example, look at a fragment of C code that contains embedded SQL statements. This particular fragment is written in Oracle’s Pro*C dialect of the C language and is code that might be found in an organization’s human resources department. This particular code block is designed to authenticate and log on a user, and then enable the user to change the salary and commission information for an employee.
EXEC SQL BEGIN DECLARE SECTION;
VARCHAR uid[20];
VARCHAR pwd[20];
VARCHAR ename[10];
FLOAT salary, comm;
SHORT salary_ind, comm_ind;
EXEC SQL END DECLARE SECTION;
main()
{
int sret; /* scanf return code */
/* Log in */
strcpy(uid.arr,"Mary"); /* copy the user name */
uid.len=strlen(uid.arr);
strcpy(pwd.arr,"Bennett"); /* copy the password */
pwd.len=strlen(pwd.arr);
EXEC SQL WHENEVER SQLERROR STOP;
EXEC SQL WHENEVER NOT FOUND STOP;
EXEC SQL CONNECT :uid;
printf("Connected to user: percents
",uid.arr);
printf("Enter employee name to update: ");
scanf("percents",ename.arr);
ename.len=strlen(ename.arr);
EXEC SQL SELECT SALARY,COMM INTO :salary,:comm
FROM EMPLOY
WHERE ENAME=:ename;
printf("Employee: percents salary: percent6.2f
comm: percent6.2f
", ename.arr, salary, comm);
printf("Enter new salary: ");
sret=scanf("percentf",&salary);
salary_ind = 0;
if (sret == EOF !! sret == 0) /* set indicator */
salary_ind =-1; /* Set indicator for NULL */
printf("Enter new commission: ");
sret=scanf("percentf",&comm);
comm_ind = 0; /* set indicator */
if (sret == EOF !! sret == 0)
comm_ind=-1; /* Set indicator for NULL */
EXEC SQL UPDATE EMPLOY
SET SALARY=:salary:salary_ind
SET COMM=:comm:comm_ind
WHERE ENAME=:ename;
printf("Employee percents updated.
",ename.arr);
EXEC SQL COMMIT WORK;
exit(0);
}
Here’s a closer look at what the code does:
In this implementation, every SQL statement is introduced with an EXEC SQL
directive. This is a clue to the compiler not to try to compile what follows, but instead to pass it directly to the DBMS’s database engine.
Module language is similar to embedded SQL in that it combines the strengths of SQL with those of a host language. However, it does it in a slightly different way. All the SQL code is stored — as procedures — in a module separate from the host language program. Whenever the host language program needs to perform a database operation, it calls a procedure from the SQL module to do the job. With this arrangement, all your SQL is kept out of the main program, so the host language compiler has no problem, and neither does the debugger. All they see is host language code, including the procedure calls. The procedures themselves cause no difficulty because they are in a separate module, and the compiler and debugger just skip over them.
Another advantage of module language over embedded SQL is that the SQL code is separated from the host language code. Because high skill in both SQL and any given host language is rare, it is difficult to find good people to program embedded SQL applications. Because a module language implementation separates the languages, you can hire the best SQL programmer to write the SQL, and the best host language programmer to write the host language code. Neither one has to be an expert in the other language.
To see how this would work, check out the following module definition, which shows you the syntax you’d use to create a module that contains SQL procedures:
MODULE [module-name]
[NAMES ARE character-set-name]
LANGUAGE {ADA|C|COBOL|FORTRAN|MUMPS|PASCAL|PLI|SQL}
[SCHEMA schema-name]
[AUTHORIZATION authorization-id]
[temporary-table-declarations…]
[cursor-declarations…]
[dynamic-cursor-declarations…]
procedures…
The MODULE
declaration is mandatory, but the module name is not. (It’s a good idea to name your modules anyway, just to reduce the confusion.) With the optional NAMES ARE
clause, you can specify a character set — Hebrew, for example, or Cyrillic. The default character set will be used if you don’t include a NAMES ARE
clause.
The next line lets you specify a host language — something you definitely have to do. Each language has different expectations about what the procedure will look like, so the LANGUAGE
clause determines the format of the procedures in the module.
Although the SCHEMA
clause and the AUTHORIZATION
clause are both optional, you must specify at least one of them. The AUTHORIZATION
clause is a security feature. If your authorization ID does not carry sufficient privileges, you won’t be allowed to use the procedures in the module.
Given the fact that SQL makes constant use of command words such as CREATE
and ALTER
, it stands to reason that it would probably be unwise to use these same words as the names of tables or variables. To do so is a guaranteed way to confuse your DBMS. In addition to such command words, a number of other words also have a special meaning in SQL. These reserved words should also not be used for any purpose other than the one for which they are designed. Consider the following SQL statement:
SELECT CustomerID, FirstName, LastName
FROM Customer
WHERE CustomerID < 1000;
SELECT
is a command word, and FROM
and WHERE
are reserved words. SQL has hundreds of reserved words, and you must be careful not to inadvertently use any of them as the names of objects or variables. Appendix A of this book contains a list of reserved words in ISO/IEC SQL:2016.
SQL is capable of dealing with data of many different types — as this aptly named section will soon make clear. From the beginning, SQL has been able to handle the common types of numeric and character data, but more recently, new types have been added that enable SQL to deal with nontraditional data types, such as BLOB, CLOB, and BINARY. At present, there are eleven major categories of data types: exact numerics, approximate numerics, character strings, binary strings, Booleans, datetimes, intervals, XML type, collection types, REF types, and user-defined types. Within each category, one or more specific types may exist.
With that proviso out of the way, read on to find brief descriptions of each of the categories as well as enumerations of the standard types they include.
Because computers store numbers in registers of finite size, there is a limit to how large or small a number can be and still be represented exactly. There is a range of numbers centered on zero that can be represented exactly. The size of that range depends on the size of the registers that the numbers are stored in. Thus a machine with 64-bit registers can exactly represent a range of numbers that is wider than the range that can be exactly represented on a machine with 32-bit registers.
After doing all the complex math, you’re left with six standard exact numeric data types. They are
INTEGER
SMALLINT
BIGINT
NUMERIC
DECIMAL
DECFLOAT
The next few sections drill down deeper into each type.
Data of the INTEGER
type is numeric data that has no fractional part. Any given implementation of SQL will have a limit to the number of digits that an integer can have. If, for some reason, you want to specify a maximum size for an integer that is less than the default maximum, you can restrict the maximum number of digits by specifying a precision argument. By declaring a variable as having type INTEGER
(10)
, you are saying numbers of this type can have no more than ten digits, even if the system you are running on is capable of handling more digits. Of course, if you specify a precision that exceeds the maximum capacity of the system, you’re not gonna get it no matter how much you whine. You cannot magically expand the sizes of the hardware registers in a machine with an SQL declaration.
The SMALLINT
data type is similar to the INTEGER
type, but how it differs from the INTEGER
type is implementation-dependent. It may not differ from the INTEGER
type at all. The only constraint on the SMALLINT
type is that its precision may be no larger than the precision of the INTEGER
type.
For systems where the precision of the SMALLINT
type actually is less than the precision of the INTEGER
type, it may be advantageous to specify variables as being of the SMALLINT
type if you can be sure that the values of those variables will never exceed the precision of the SMALLINT
type. This saves you some storage space. If storage space is not an issue, or if you cannot be absolutely sure that the value of a variable will never exceed the precision of the SMALLINT
type, you may be better off specifying it as being of the INTEGER
type.
The BIGINT
type is similar to the SMALLINT
type. The only difference is that the precision of the BIGINT
type can be no smaller than the precision of the INTEGER
type. As is the case with SMALLINT
, the precision of the BIGINT
type could be the same as the precision of the INTEGER
type.
If the precision of the BIGINT
type for any given implementation is actually larger than the precision of the INTEGER
type, a variable of the BIGINT
type will take up more storage space than a variable of the INTEGER
type. Only use the BIGINT
type if there is a possibility that the size of a variable may exceed the precision of the INTEGER
type.
Data of the NUMERIC
type does have a fractional part. This means the number contains a decimal point and zero or more digits to the right of the decimal point. For NUMERIC
data, you can specify both precision and scale. The scale of a number is the number of digits to the right of the decimal point. For example, a variable declared as of type NUMERIC
(10, 2)
would have a maximum of ten digits, with two of those digits to the right of the decimal point. The largest number you can represent with this type is 99,999,999.99. If the system you are running on happens to be able to handle numbers with precision greater than ten, only the precision you specify will be used.
Data of the DECIMAL
type is similar to data of the NUMERIC
type with one difference. For data of the DECIMAL
type, if the system you are running on happens to be able to handle numbers with larger precision than what you have specified, the extra precision will be used.
DECFLOAT is a new exact numeric data type in SQL:2016. It was added to ISO/IEC standard SQL specifically for business applications that deal with exact decimal values. Floating point data types, such as REAL and DOUBLE, can handle larger numbers than exact numerics such as NUMERIC and DECIMAL. However, they cannot be counted upon to produce exact decimal values. DECFLOAT can handle larger numbers than other exact numeric data types, and retain the exactness of an exact numeric type.
The approximate numeric types (all three of them) exist so that you can represent numbers either too large or too small to be represented by an exact numeric type. If, for example, a system has 32-bit registers, then the largest number that can be represented with an exact numeric type is the largest number that can be represented with 32 binary digits — which happens to be 4,294,967,295 in decimal. If you have to deal with numbers larger than that, you must move to approximate numerics or buy a computer with 64-bit registers. Using approximate numerics may not be much of a hardship: For most applications, after you get above four billion, approximations are good enough.
Similarly, values very close to zero cannot be represented with exact numerics either. The smallest number that can be represented exactly on a 32-bit machine has a one in the least significant bit position and zeros everywhere else. This is a very small number, but there are a lot of numbers of interest, particularly in science, that are smaller. For such numbers, you must also rely on approximate numerics.
With that intro out of the way, it’s time to meet the three approximate numeric types: REAL
, DOUBLE PRECISION
, and FLOAT
.
The REAL
data type is what you would normally use for single-precision floating-point numbers. The exact meaning of the term single precision depends on the implementation. This is hardware-dependent and a machine with 64-bit registers will, in general, have a larger precision than a machine with 32-bit registers. How much larger may vary from one implementation to another.
A double-precision number, which is the basis for the double precision (DOUBLE
) data type, on any given system has greater precision than a real number on the same system. However, despite the name, a double-precision number does not necessarily have twice the precision of a real number. The most that can be said in general is that a double-precision number on any given system has greater precision than does a real number on the same system. On some systems, a double-precision number may have a larger mantissa than does a real number. On other systems, a double-precision number may support a larger exponent (absolute value). On yet other systems, both mantissa and exponent of a double-precision number may be larger than for a real number. You will have to look at the specifications for whatever system you are using to find out what is true for you.
The FLOAT
data type is very similar to the REAL
data type. The difference is that with the FLOAT
data type you can specify a precision. With the REAL
and DOUBLE PRECISION
data types, the default precision is your only option. Because the default precision of these data types can vary from one system to another, porting your application from one system to another could be a problem. With the FLOAT
data type, specifying the precision of an attribute on one machine guarantees that the precision will be maintained after porting the application to another machine. If a system’s hardware supports double-precision operations and the application requires double-precision operations, the FLOAT
data type automatically uses the double-precision circuitry. If single-precision is sufficient, it uses that.
After numbers, the next most common thing to be stored is strings of alphanumeric characters. SQL provides several character string types, each with somewhat different characteristics from the others. The three main types are CHARACTER
, CHARACTER VARYING
, and CHARACTER LARGE OBJECT
. These three types are mirrored by NATIONAL CHARACTER
, NATIONAL CHARACTER VARYING
, and NATIONAL CHARACTER LARGE OBJECT
, which deal with character sets other than the default character set, which is usually the character set of the English language.
A column defined as being of type CHARACTER
or CHAR
can contain any of the normal alphanumeric characters of the language being used. A column definition also includes the maximum length allowed for an item of the CHAR
type. Consider this example:
Name CHAR (15)
This field can hold a name up to 15 characters long. If the name is less than 15 characters long, the remaining spaces are filled with blank characters to bring the total length up to 15. Thus a CHARACTER
field always takes up the same amount of space in memory, regardless of how long the actual data item in the field is.
The CHARACTER VARYING
or VARCHAR
data type is like the CHARACTER
type in all respects except that short entries are not padded out with blanks to fill the field to the stated maximum.
Name VARCHAR (15)
The VARCHAR
data type doesn’t add blanks on the end of a name. Thus if the Name field contains Joe, the length of the field that is stored will be only three characters rather than fifteen.
Any implementation of SQL has a limit to the number of characters that are allowed in a CHARACTER
or CHARACTER VARYING
field. For example, the maximum length of a character string in Oracle 11g is 1,024 characters. If you want to store text that goes beyond that limit, you can use the CHARACTER LARGE
OBJECT
data type. The CLOB
type, as it is affectionately known, is much less flexible than either the CHAR
or VARCHAR
types in that it does not allow you to do many of the fine-grained manipulations that you can do in those other types. You can compare two CLOB
items for equality, but that’s about all you can do. With CHARACTER
type data you can, for example, scan a string for the first occurrence of the letter W, and display where in the string it occurs. This type of operation is not possible with CHARACTER LARGE OBJECT
data.
Here’s an example of the declaration of a CHARACTER LARGE OBJECT
:
Dream CLOB (8721)
Another restriction on CLOB
data is that a CLOB
data item may not be used as a primary key or a foreign key. Furthermore, you cannot apply the UNIQUE
constraint to an item of the CLOB
type. The bottom line is that the CLOB
data type enables you to store and retrieve large blocks of text, but it turns out you can’t do much with them beyond that.
Different languages use different character sets. For example, Spanish and German have letters with diacritical marks that change the way the letter is pronounced. Other languages, such as Russian, have an entirely different character set. To store character strings that contain these different character sets, the various national character types have been added to SQL. If the English character type is the default on your system, as it is for most people, you can designate a different character set as your national character set. From that point on, when you specify a data type as NATIONAL CHARACTER
, NATIONAL CHARACTER VARYING
, or NATIONAL CHARACTER LARGE OBJECT
, items in columns so specified use the chosen national character set rather than the default character set.
In addition to whatever national character set you specify, you can use multiple other character sets in a table definition, by specifying them explicitly. Here’s an example where the national character set is Russian, but you explicitly add Greek and Kanji (Japanese) to the mix:
CREATE TABLE BOOK_TITLE_TRANSLATIONS (
English CHARACTER (40),
Greek VARCHAR (40) CHARACTER SET GREEK,
Russian NATIONAL CHARACTER (40),
Japanese CHARACTER (40) CHARACTER SET KANJI
) ;
The various binary string data types were added to SQL:2008. Binary strings are like character strings except that the only characters allowed are 1 and 0. There are three different types of binary strings, BINARY
, BINARY VARYING
, and BINARY LARGE OBJECT
.
A string of binary characters of the BINARY
type must be some multiple of eight bits long. You can specify such a string with BINARY (x)
, where x is the number of bytes of binary data contained in the string. For example, if you specify a binary string with BINARY (2)
, then the string will be two bytes, or 16 bits long. Byte one is defined as the first byte of the string.
The BINARY VARYING
or VARBINARY
type is like the BINARY
type except the string length need not be x bytes long. A string specified as VARBINARY (x)
can be a minimum of zero bytes long and a maximum of x bytes long.
The BINARY LARGE OBJECT
(BLOB
) type is used for a really large binary number. That large binary number may represent the pixels in a graphical image, or something else that doesn’t seem to be a number. However, at the most fundamental level, it is a number.
The BLOB
type, like the CLOB
type, was added to the SQL standard to reflect the reality that more and more of the things that people want to store in databases do not fall into the classical categories of being either numbers or text. You cannot perform arithmetic operations on BLOB
data, but at least you can store it in a relational database and perform some elementary operations on it.
A column of the BOOLEAN
data type, named after nineteenth-century English mathematician George Boole, will accept any one of three values: TRUE
, FALSE
, and UNKNOWN
. The fact that SQL entertains the possibility of NULL
values expands the traditional restriction of Boolean values from just TRUE
and FALSE
to TRUE
, FALSE
, and UNKNOWN
. If a Boolean TRUE
or FALSE
value is compared to a NULL
value, the result is UNKNOWN
. Of course, comparing a Boolean UNKNOWN
value to any value also gives an UNKNOWN
result.
You often need to store either dates, times, or both, in addition to numeric and character data. ISO/IEC standard SQL defines five datetime types. Because considerable overlap exists among the five types, not all implementations of SQL include all five types. This could cause problems if you try to migrate a database from a platform that uses one subset of the five types to a platform that uses a different subset. There is not much you can do about this except deal with it when the issue arises.
The DATE
data type is the one to use if you care about the date of something but could not care less about the time of day within a date. The DATE
data type stores a year, month, and day in that order, using ten character positions in the form yyyy-mm-dd. If you were recording the dates that humans first landed on the Moon, the entry for Apollo 11 would be 1969-07-20.
Suppose you want to store the time of day, but don’t care which day, and furthermore, don’t even care which time zone the time refers to? In that case, the TIME WITHOUT TIME ZONE
data type is just the ticket. It stores hours, minutes, and seconds. The hours and minutes data occupies two digits apiece. The seconds data also occupies two digits, but in addition may include a fractional part for fractions of a second. If you specify a column as being of TIME WITHOUT TIME ZONE
type, with no parameter, it will hold a time that has no fractional seconds. An example is 02:56:31, which is fifty-six minutes and thirty one seconds after two in the morning.
For greater precision in storing a time value, you can use a parameter to specify the number of digits beyond the decimal point that will be stored for seconds. Here’s an example of such a definition:
Smallstep TIME WITHOUT TIME ZONE (2),
In this example, there are two digits past the decimal point, so time is measured down to a hundredth of a second. It would take the form of 02:56:31.17.
The TIME WITH TIME ZONE
data type gives you all the information that you get in the TIME WITHOUT TIME ZONE
data type, and adds the additional fact of what time zone the time refers to. All time zones around the Earth are referenced to Coordinated Universal Time (UTC), formerly known as Greenwich Mean Time (GMT). Coordinated Universal Time is the time in Greenwich, U.K., which was the place where people first started being concerned with highly accurate timekeeping. Of course, the United Kingdom is a fairly small country, so UTC is in effect throughout the entire U.K. In fact, a huge “watermelon slice” of the Earth, running from the North Pole to the South Pole, is also in the same time zone as Greenwich. There are 24 such slices that girdle the Earth. Times around the earth range from eleven hours and fifty-nine minutes behind UTC to twelve hours ahead of UTC (not counting Daylight Saving Time). If Daylight Saving Time is in effect, the offset from UTC could be as much as –12:59 or +13:00. The International Date Line is theoretically exactly opposite Greenwich on the other side of the world, but is offset in spots so as to keep some countries in one time zone.
Just as sometimes you will need to record dates, and other times you will need to record times, it’s certain that there will also be times when you need to store both times and dates. That is what the TIMESTAMP WITHOUT TIME ZONE
data type is for. It is a combination of the DATE
type and the TIME WITHOUT TIMEZONE
type. The one difference between this data type and the TIME WITHOUT TIMEZONE
type is that the default value for fractions of a second is six digits rather than zero. You can, of course, specify zero fractional digits, if that is what you want. Suppose you specified a database table column as follows:
Smallstep TIMESTAMP WITHOUT TIME ZONE (0),
A valid value for Smallstep would be 1969-07-21 02:56:31. That was the date and time in Greenwich when Neil Armstrong’s foot first touched the lunar soil. It consists of ten date characters, a blank space separator, and eight time characters.
If you have to record the time zone that a date and time refers to, use the TIMESTAMP WITH TIME ZONE
data type. It’s the same as the TIMSESTAMP WITHOUT TIME ZONE
data type, with the addition of an offset that shows the time’s relationship to Coordinated Universal Time. Here’s an example:
Smallstep TIMESTAMP WITH TIME ZONE (0),
In this case, Smallstep might be recorded as 1969-07-20 21:56:31-05:00. That is the date and time in Houston when Neil Armstrong’s foot first touched the lunar soil. Houston time is normally six hours ahead of Greenwich time, but in July, it is only five hours ahead due to Daylight Saving Time.
An interval is the difference between two dates, two times, or two datetimes. There are two different kinds of intervals, the year-month interval and the day-hour-minute-second interval. A day always has 24 hours. An hour always has 60 minutes. A minute always has 60 seconds. However, a month may have 28, 29, 30, or 31 days. Because of that variability, you cannot mix the two kinds of intervals. A field of the INTERVAL
type can store the difference in time between two instants in the same month, but cannot store an interval such as 2 years, 7 months, 13 days, 5 hours, 6 minutes, and 45 seconds.
The SQL/XML:2003 update to the ISO/IEC SQL standard introduced the XML data type. Values in the XML type are XML values, meaning you can now manage and query XML data in an SQL database.
With SQL/XML:2006, folks moved to the XQuery Data Model, which means that any XML value is also an XQuery sequence. The details of the XQuery Data Model are beyond the scope of this book. Refer to Querying XML, by Jim Melton and Stephen Buxton (published by Morgan Kaufmann), for detailed coverage of this topic.
With the introduction of SQL/XML:2006, three specific subtypes of the XML type were defined. They are XML(SEQUENCE)
, XML(CONTENT)
, and XML(DOCUMENT)
. The three subtypes are related to each other hierarchically. An XML(SEQUENCE)
is any sequence of XML nodes, XML values, or both. An XML(CONTENT)
is an XML(SEQUENCE)
that is an XML fragment wrapped in a document node. An XML(DOCUMENT)
is an XML(CONTENT)
that is a well-formed XML document.
Every XML value is at least an XML(SEQUENCE)
. An XML(SEQUENCE)
that is a document node is an XML(CONTENT)
. An XML(CONTENT)
that has legal document children is an XML(DOCUMENT)
.
XML types may be associated with an XML schema. There are three possibilities:
UNTYPED
: There is no associated XML schema.XMLSCHEMA
: There is an associated XML schema.ANY
: There may or may not be an associated XML schema.So a document of type XML(DOCUMENT(ANY))
may or may not have an associated XML schema. If you specify a column as being of type XML
with no modifiers, it must be either XML(SEQUENCE)
, XML(CONTENT(ANY)
, or XML(CONTENT(UNTYPED))
. Which of those it is depends on the implementation.
The ROW
type, introduced in the 1999 version of the ISO/IEC SQL standard (SQL:1999), represents the first break of SQL away from the relational model, as defined by its creator, Dr. E.F. Codd. With the introduction of this type, SQL databases can no longer be considered pure relational databases. One of the defining characteristics of Codd’s First Normal Form (1NF) is the fact that no field in a table row may be multivalued. Multivalued fields are exactly what the ROW
type introduces. The ROW
type enables you to place a whole row’s worth of data into a single field, effectively nesting a row within a row. To see how this works, create a ROW
type.
Note: The normal forms constrain the structure of database tables as a defense against anomalies, which are inconsistencies in table data or even outright wrong values. 1NF is the least restrictive of the normal forms, and thus the easiest to satisfy. Notwithstanding that, a table that includes a ROW
type fails the test of First Normal Form. According to Dr. Codd, such a table is not a relation, and thus cannot be present in a relational database. I give extensive coverage to normalization and the normal forms in Book 2, Chapter 2.
CREATE ROW TYPE address_type (
Street VARCHAR (25),
City VARCHAR (20),
State CHAR (2),
PostalCode VARCHAR (9)
) ;
This code effectively compresses four attributes into a single type. After you have created a ROW
type — such as address_type
in the preceding example — you can then use it in a table definition.
CREATE TABLE VENDOR (
VendorID INTEGER PRIMARY KEY,
VendorName VARCHAR (25),
Address address_type,
Phone VARCHAR (15)
) ;
If you have tables for multiple groups, such as vendors, employees, customers, stockholders, or prospects, you have to declare only one attribute rather than four. That may not seem like much of a savings, but you’re not limited to putting just four attributes into a ROW
type. What if you had to type in the same forty attributes into a hundred tables?
The introduction of ROW
types in SQL:1999 was not the only break from the ironclad rules of relational database theory. In that same version of the standard, the ARRAY
type was introduced, and in SQL:2003, the MULTISET
type was added. Both of these collection types violate the ol’ First Normal Form (1NF) and thus take SQL databases a couple of steps further away from relational purity.
The ARRAY
type violates 1NF, but not in the same way that the ROW
type does. The ARRAY
type enables you to enhance a field of an existing type by putting more than one entry into it. This creates a repeating group, which was demonized in Codd’s original formulation of the relational model, but now reappears as a desirable feature. Arrays are ordered in the sense that each element in the array corresponds to exactly one ordinal position in the array.
You might ask how a repeating group of the ARRAY
type differs from the ROW
type’s ability to put “a whole row’s worth of data into a single field.” The distinction is subtle. The ROW
type enables you to compress multiple different attributes into a single field, such as a street, city, state, and postal code. The repeating group of the ARRAY
type enables you to put multiple instances of the same attribute into a single field, such as a phone number and three alternate phone numbers.
As an example, suppose you want to have alternate ways of contacting your vendors in case the main telephone number does not work for you. Perhaps you would like the option of storing as many as four telephone numbers, just to be safe. A slight modification to the code shown previously will do the trick.
CREATE TABLE VENDOR (
VendorID INTEGER PRIMARY KEY,
VendorName VARCHAR (25),
Address address_type,
Phone VARCHAR (15) ARRAY [4]
) ;
When he created the relational model, Dr. Codd made a conscious decision to sacrifice some functional flexibility in exchange for enhanced data integrity. The addition of the ARRAY
type, along with the ROW
type and later the MULTISET
type, takes back some of that flexibility in exchange for added complexity. That added complexity could lead to data integrity problems if it is not handled correctly. The more complex a system is, the more things that can go wrong, and the more opportunities there are for people to make mistakes.
Whereas an array is an ordered collection of elements, a multiset is an unordered collection. You cannot reference individual elements in a multiset because you don’t know where they are located in the collection. If you want to have multiples of an attribute, such as phone numbers, but don’t care what order they are listed in, you can use a multiset rather than an array.
REF
types are different from distinct data types such as INTEGER
or CHAR
. They are used in obscure circumstances by highly skilled SQL wizards, and just about nobody else. Instead of holding values, an REF
type references a user-defined structured type associated with a typed table. Typed tables are beyond the scope of this book, but I mention REF
type here for the sake of completeness.
REF
types are not a part of core SQL. This means that database vendors can claim compliance with the SQL standard without implementing REF
types.
The REF
type is an aspect of the object-oriented nature of SQL since the SQL:1999 standard. If object-oriented programming seems obscure to you, as it does to many programmers of a more traditional bent, you can probably survive quite well without ever needing the REF
type.
User-defined types (UDTs) are another addition to SQL imported from the world of object-oriented programming. If the data types that I have enumerated here are not enough for you, you can define your own data types. To do so, use the principles of abstract data types (ADTs) that are major features of such object-oriented languages as C++.
The object-oriented nature of UDTs becomes evident when you see that a UDT has attributes and methods encapsulated within it. The attribute definitions and the results of the methods are visible to the outside world, but the ways the methods are actually implemented are hidden from view. In this object-oriented world, you can declare attributes and methods to be public, private, or protected. A public attribute or method is available to anyone who uses the UDT. A private attribute or method may be used only by the UDT itself. A protected attribute or method may be used only by the UDT itself and its subtypes. (If this sounds familiar to you, don’t be surprised — an SQL UDT is much like a class in object-oriented programming.)
There are two kinds of UDTs: distinct types and structured types. The next sections take a look at each one in turn.
A distinct type is very similar to a regular predefined SQL type. In fact, a distinct type is derived directly from a predefined type, called the source type. You can create multiple distinct types from a single source type, each one distinct from all the others and from the source type. Here’s how to create a distinct type from a predefined type:
CREATE DISTINCT TYPE USdollar AS DECIMAL (10,2) ;
This definition (USdollar
) creates a new data type for (wait for it) U.S. dollars, based on the predefined DECIMAL
type. You can create additional distinct types in the same way:
CREATE DISTINCT TYPE Euro AS DECIMAL (10,2) ;
Now you can create tables that use the new types:
CREATE TABLE USinvoice (
InvoiceNo INTEGER PRIMARY KEY,
CustomerID INTEGER,
SalesID INTEGER,
SaleTotal USdollar,
Tax USdollar,
Shipping USdollar,
GrandTotal USdollar
) ;
CREATE TABLE Europeaninvoice (
InvoiceNo INTEGER PRIMARY KEY,
CustomerID INTEGER,
SalesID INTEGER,
SaleTotal Euro,
Tax Euro,
Shipping Euro,
GrandTotal Euro
) ;
The USdollar
type and the Euro
type are both based on the DECIMAL
type, but you cannot directly compare a USdollar
value to a Euro
value, nor can you directly compare either of those to a DECIMAL
value. This is consistent with reality because one U.S. dollar is not equal to one euro. However, it is possible to exchange dollars for euros and vice versa when traveling. You can make that exchange with SQL too, but not directly. You must use a CAST
operation, which I describe in Book 3, Chapter 1.
Structured types are not based on a single source type as are the distinct types. Instead, they are expressed as a list of attributes and methods. When you create a structured UDT, the DBMS automatically creates a constructor function, a mutator function, and an observer function. The constructor for a UDT is given the same name as the UDT. Its job is to initialize the UDT’s attributes to their default values. When you invoke a mutator function, it changes the value of an attribute of a structured type. You can then use an observer function to retrieve the value of an attribute of a structured type. If you include an observer function in a SELECT
statement, it will retrieve values from the database.
A hierarchical relationship can exist between two structured types. One structured type can be a “child” or subtype of a “parent” or supertype. Consider an example involving books. Suppose you have a UDT named BookUDT
, which has a subtype named NovelUDT
and another subtype named TechBookUDT
. BookUDT
is a supertype of both subtypes. Suppose further that TechBookUDT
has a subtype named DatabaseBookUDT
. DatabaseBookUDT
is not only a subtype of TechBookUDT
, but also a subtype of BookUDT
. Because DatabaseBookUDT
is a direct child of TechBookUDT
it is considered a proper subtype of TechBookUDT
. Since DatabaseBookUDT
is not a direct child of BookUDT
, but rather a grandchild, it is not considered a proper subtype of BookUDT
.
A structured type that has no supertype is considered a maximal supertype, and a structured type that has no subtypes is considered a leaf subtype.
Here’s how you can create structured UDTs:
/* Create a UDT named BookUDT */
CREATE TYPE BookUDT AS
/* Specify attributes */
Title CHAR (40),
Author CHAR (40),
MyCost DECIMAL (9,2),
ListPrice DECIMAL (9.2)
/* Allow for subtypes */
NOT FINAL ;
/* Create a subtype named TechBookUDT */
CREATE TYPE TechBookUDT UNDER BookUDT NOT FINAL ;
/* Create a subtype named DatabaseBookUDT */
CREATE TYPE DatabaseBookUDT UNDER TechBookUDT FINAL ;
Note: In this code, comments are enclosed within /* comment */
pairs. The NOT FINAL
keywords indicate that even though a semicolon is closing out the statement, there is more to come. Subtypes are about to be defined under the supertype. The lowest level subtype closes out with the keyword FINAL
.
Now that the types are defined, you can create tables that use them.
CREATE TABLE DATABASEBOOKS (
StockItem DatabaseBookUDT,
StockNumber INTEGER
) ;
Now that the table exists, you can add data to it.
BEGIN
/* Declare a temporary variable x */
DECLARE x = DatabaseBookUDT;
/* Execute the constructor function */
Set x = DatabaseBookUDT() ;
/* Execute the first mutator function */
SET x = x.Title('SQL for Dummies') ;
/* Execute the second mutator function */
SET x = x.Author('Allen G. Taylor') ;
/* Execute the third mutator function */
SET x = x.MyCost(23.56) ;
/* Execute the fourth mutator function */
SET x = x.ListPrice(29.99) ;
INSERT INTO DATABASEBOOKS VALUES (x, 271828) ;
END
Table 6-1 summarizes the SQL data types and gives an example of each.
TABLE 6-1 Data Types
Data Type |
Example Value |
|
' |
|
' |
|
' |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Not an ordinary type, but a pointer to a referenced type |
|
Currency type based on |
1 Argument specifies number of fractional digits.
SQL is different from practically any computer language that you may have encountered up to this point in that it allows null values. Other languages don’t. Allowing null values gives SQL a flexibility that other languages lack, but also contributes to the impedance mismatch between SQL and host languages that it must work with in an application. If an SQL database contains null values that the host language does not recognize, you have to come up with a plan that handles that difference in a consistent way.
I’m borrowing the term impedance mismatch from the world of electrical engineering. If, for example, you’ve set up your stereo system using speaker cable with a characteristic impedance of 50 ohms feeding speakers with an impedance of 8 ohms, you’ve got yourself a case of impedance mismatch and you’ll surely get fuzzy, noisy sound — definitely low fidelity. If a data type of a host language does not exactly match the corresponding data type of SQL, you have a similar situation, bad communication across the interface between the two.
A null value is a nonvalue. If you are talking about numeric data, a null value is not the same as zero, which is a definite value. It is one less than one. If you are talking about character data, a null value is not the same as a blank space. A blank space is also a definite value. If you are talking about Boolean data, a null value is not the same as FALSE
. A false Boolean value is a definite value too.
A null value is the absence of a value. It reminds me of the Buddhist concept of emptiness. I almost feel that if I ever come to understand null values completely, I will have transcended the illusions of this world and achieved a state of enlightenment.
A field may contain a null value for several reasons:
In any situation where knowledge is incomplete, null values are possible. Because in most application areas, knowledge is never complete, null values are very likely to appear in most databases.
Constraints are one of the primary mechanisms for keeping the contents of a database from turning into a misleading or confusing mess. By applying constraints to tables, columns, or entire databases, you prevent the addition of invalid data or the deletion of data that is required to maintain overall consistency. A constraint can also identify invalid data that already exists in a database. If an operation that you perform in a transaction causes a constraint to be violated, the DBMS will prevent the transaction from taking effect (being committed). This protects the database from being put into an inconsistent state.
You can constrain the contents of a table column. In some cases, that means constraining what the column must contain, and in other cases, what it may not contain. There are three kinds of column constraints: the NOT NULL
, UNIQUE
, and CHECK
constraints.
Although SQL allows a column to contain null values, there are times when you want to be sure that a column always has a distinct value. In order for one row in a table to be distinguished from another, there must be some way of telling them apart. This is usually done with a primary key, which must have a unique value in every row. Because a null value in a column could be anything, it might match the value for that column in any of the other rows. Thus it makes sense to disallow a null value in the column that is used to distinguish one row from the rest. You can do this with a NOT NULL
constraint, as shown in the following example:
CREATE TABLE CLIENT (
ClientName CHAR (30) NOT NULL,
Address1 CHAR (30),
Address2 CHAR (30),
City CHAR (25),
State CHAR (2),
PostalCode CHAR (10),
Phone CHAR (13),
Fax CHAR (13),
ContactPerson CHAR (30)
) ;
When entering a new client into the CLIENT table, you must make an entry in the ClientName column.
The NOT NULL
constraint is a fairly weak constraint. You can satisfy the constraint as long as you put anything at all into the field, even if what you put into it would allow inconsistencies into your table. For example, suppose you already had a client named David Taylor in your database, and someone tried to enter another record with the same client name. If the table was protected only by a NOT NULL
constraint, the entry of the second David Taylor would be allowed. Now when you go to retrieve David Taylor’s information, which one will you get? How will you tell whether you have the one you want? A way around this problem is to use the stronger UNIQUE
constraint. The UNIQUE
constraint will not only disallow the entry of a null value in a column, but it will also disallow the entry of a value that matches a value already in the column.
Use the CHECK
constraint for preventing the entry of invalid data that goes beyond maintaining uniqueness. For example, you can check to make sure that a numeric value falls within an allowed range. You can also check to see that a particular character string is not entered into a column.
Here’s an example that ensures that the charge for a service falls within the acceptable range. It insures that a customer is not mistakenly given a credit rather than a debit, and that she is not charged a ridiculously high amount either.
CREATE TABLE TESTS (
TestName CHARACTER (30) NOT NULL,
StandardCharge NUMERIC (6,2)
CHECK (StandardCharge >= 0.00
AND StandardCharge <= 200.00)
) ;
The constraint is satisfied only if the charge is positive and less than or equal to $200.
Sometimes a constraint applies not just to a column, but to an entire table. The PRIMARY KEY
constraint is the principal example of a table constraint; it applies to an entire table.
Although a primary key may consist of a single column, it could also be made up of a combination of two or more columns. Because a primary key must be guaranteed to be unique, multiple columns may be needed if one column is not enough to guarantee uniqueness.
To see what I mean, check out the following, which shows a table with a single-column primary key:
CREATE TABLE PROSPECT (
ProspectName CHAR (30) PRIMARY KEY,
Address1 CHAR (30),
Address2 CHAR (30),
City CHAR (25),
State CHAR (2),
PostalCode CHAR (10),
Phone CHAR (13),
Fax CHAR (13)
) ;
The primary key constraint in this case is listed with the ProspectName column, but it is nonetheless a table constraint because it guarantees that the table contains no duplicate rows. By applying the primary key constraint to ProspectName, you are guaranteeing that ProspectName cannot have a null value, and no entry in the ProspectName column may duplicate another entry in the ProspectName column. Because ProspectName is guaranteed to be unique, every row in the table must be distinguishable from every other row.
ProspectName may not be a particularly good choice for a proposed primary key. Some people have rather common names— Joe Wilson or Jane Adams. It is quite possible that two people with the same name might both be prospects of your business. You could overcome that problem by using more than one column for the primary key. Here’s one way to do that:
CREATE TABLE PROSPECT (
ProspectName CHAR (30) NOT NULL,
Address1 CHAR (30) NOT NULL,
Address2 CHAR (30),
City CHAR (25),
State CHAR (2),
PostalCode CHAR (10),
Phone CHAR (13),
CONSTRAINT prospect_pk PRIMARY KEY
(ProspectName, Address1)
) ;
A composite primary key is made up of both ProspectName and Address1.
You might ask, “What if a father and son have the same name and live at the same address?” The more such scenarios you think up, the more complex things tend to get. In many cases, it’s best to make up a unique ID number for every row in a table and let that be the primary key. If you use an autoincrementer to generate the keys, you can be sure they are unique. This keeps things relatively simple. You can also program your own unique ID numbers by storing a value in memory and incrementing it by one after each time you add a new record that uses the stored value as its primary key.
CREATE TABLE PROSPECT (
ProspectID INTEGER PRIMARY KEY,
ProspectName CHAR (30),
Address1 CHAR (30),
Address2 CHAR (30),
City CHAR (25),
State CHAR (2),
PostalCode CHAR (10),
Phone CHAR (13)
) ;
Many database management systems automatically create autoincrementing primary keys for you as you enter new rows into a table.
Relational databases are categorized as they are because the data is stored in tables that are related to each other in some way. The relationship occurs because a row in one table may be directly related to one or more rows in another table.
For example, in a retail database, the record in the CUSTOMER table for customer Lisa Mazzone is directly related to the records in the INVOICE table for purchases that Ms. Mazzone has made. To establish this relationship, one or more columns in the CUSTOMER table must have corresponding columns in the INVOICE table.
The primary key of the CUSTOMER table uniquely identifies each customer. The primary key of the INVOICE table uniquely identifies each invoice. In addition, the primary key of the CUSTOMER table acts as a foreign key in INVOICE to link the two tables. In this setup, the foreign key in each row of the INVOICE table identifies the customer who made this particular purchase. Here’s an example:
CREATE TABLE CUSTOMER (
CustomerID INTEGER PRIMARY KEY,
CustomerName CHAR (30),
Address1 CHAR (30),
Address2 CHAR (30),
City CHAR (25),
State CHAR (2),
PostalCode CHAR (10),
Phone CHAR (13)
) ;
CREATE TABLE SALESPERSON (
SalespersonID INTEGER PRIMARY KEY,
SalespersonName CHAR (30),
Address1 CHAR (30),
Address2 CHAR (30),
City CHAR (25),
State CHAR (2),
PostalCode CHAR (10),
Phone CHAR (13)
) ;
CREATE TABLE INVOICE (
InvoiceNo INTEGER PRIMARY KEY,
CustomerID INTEGER,
SalespersonID INTEGER,
CONSTRAINT customer_fk FOREIGN KEY (CustomerID)
REFERENCES CUSTOMER (CustomerID),
CONSTRAINT salesperson_fk FOREIGN KEY (SalespersonID)
REFERENCES SALESPERSON (SalespersonID)
) ;
Each invoice is related to the customer who made the purchase and the salesperson who made the sale.
Using constraints in this way is what makes relational databases relational. This is the core of the whole thing right here! How do the tables in a relational databases relate to each other? They relate by the keys they hold in common. The relationship is established, but also constrained by the fact that a column in one table has to match a corresponding column in another table. The only relationships present in a relational database are those where there is a key-to-key link mediated by a foreign key constraint.
Sometimes a constraint may apply not just to a column or a table, but to multiple tables or even an entire database. A constraint with such broad applicability is called an assertion.
Suppose a small bookstore wants to control its exposure to dead inventory by not allowing total inventory to grow beyond 20,000 items. Suppose further that stocks of books and DVDs are maintained in different tables — the BOOKS and DVD tables. An assertion can guarantee that the maximum is not exceeded.
CREATE TABLE BOOKS (
ISBN INTEGER,
Title CHAR (50),
Quantity INTEGER ) ;
CREATE TABLE DVD (
BarCode INTEGER,
Title CHAR (50),
Quantity INTEGER ) ;
CREATE ASSERTION
CHECK ((SELECT SUM (Quantity)
FROM BOOKS)
+ (SELECT SUM (Quantity)
FROM DVD)
< 20000) ;
This assertion adds up all the books in stock, then adds up all the DVDs in stock, and finally adds those two sums together. It then checks to see that the sum of them all is less than 20,000. Whenever an attempt is made to add a book or DVD to inventory, and that addition would push total inventory to 20,000 or more, the assertion is violated and the addition is not allowed.
Most popular implementations do not support assertions. For example, SQL Server 2016, DB2, Oracle Database 18c, SAP SQL Anywhere, MySQL, and PostgreSQL do not. Assertions may become available in the future, since they are a part of SQL:2003, but it would not be wise to hold your breath until this functionality appears. Although a feature that would be nice to have, assertions are far down on the list of features to add for most DBMS vendors.
18.218.129.100