In Chapters 15, “Creating Tables,” and 16, “Specifying Integrity Constraints,” we have shown which statements we can use to create tables with their integrity constraints. This chapter looks more closely at the process of designing a database structure. Before you can create a database, you must have designed a structure for it. During this design process, you decide which tables should be defined and which columns should be included in each table. The process of designing databases, then, is comparable to the work of an architect, while creating tables resembles the construction job.
For any given application, there are generally several possible table structures. This choice can be subject to different factors:
Storage space available
Maximum acceptable processing time for updates
Maximum acceptable processing time for SELECT
statements
Security
Before starting the actual design, the designer must decide which factors are most relevant to the situation. Is it essential to save as much storage space as possible? Should SELECT
statements take, at most, 3 seconds of processing time? Or should we take into consideration several different factors simultaneously?
Having to consider a combination of factors nearly always leads to conflicts. A savings in storage space, for example, means that SELECT
statements will take longer to process. Looking at this another way, if every SELECT
statement must be processed quickly, much of the data must be stored repeatedly, and this, of course, requires more storage space. Additionally, data redundancy leads to slower processing of updates because each logical update requires an update to more than one table.
Many different techniques exist for designing a table structure. We do not describe these techniques here because they fall outside the context of this book. However, in the following sections, we present ten basic guidelines for designing a database structure. For each guideline, we try to indicate its influence on the first four factors listed previously.
Determining the tables and columns in a database design is the most important aspect of the design process. Much has already been written about the subject. We concentrate only on the most important guidelines. You can find more comprehensive coverage of the database design process in [DATE95] and [SIMS04].
If a table has several candidate keys, you must make a choice. Always choose the one that consists of the smallest number of columns as the primary key. This simplifies the process of joining tables for, among other things, SELECT
statements.
If this criterion does not lead to a good solution for some reason, choose the column that contains “internal” values. By “internal,” we mean values for which your own organization is the owner. For example, do not choose passport number as a key because the government is the owner of this number.
If this criterion does not work, choose the primary key that uses the least amount of storage space—for example, a CHAR(5)
column instead of a VARCHAR(30)
column.
This guideline is often referred to in the literature as the Boyce-Codd normal form; see, among others, [DATE95].
This is the first time we have used the term determinant, so it needs to be explained. Column A is a determinant of column B if, for each different value in A there is, at most, one associated value in B. The PLAYERNO
column in the PLAYERS
table, for example, is a determinant of all other columns in the table. A determinant can consist of more than one column.
Imagine that the column DETER
in table T
is a determinant of column C
. The following SELECT
statement will never return a result (work it out for yourself):
SELECT DETER FROM T GROUP BY DETER HAVING COUNT(DISTINCT C) > 1
Here is an example of a table design that does not follow the second guideline:
PLAYERNO NAME TEAMNO DIVISION -------- --------- ------ -------- 6 Parmenter 1 first 44 Baker 1 first 27 Collins 2 second 104 Moorman 2 second
The PLAYERNO
column is the primary key. Thus, the table follows the first guideline. The determinant of the NAME
column is PLAYERNO
. This is also true for TEAMNO
and DIVISION
; every PLAYERNO
belongs to, at most, one TEAMNO
and one DIVISION
. However, TEAMNO
is also a determinant of DIVISION
because every TEAMNO
has, at most, one associated DIVISION
. TEAMNO
, then, is a determinant but not a candidate key. The conclusion is that the table does not follow the second guideline.
The most significant disadvantage of a table that does not comply with the second guideline is that certain “facts” are recorded several times. In the previous example, the fact that team 1 plays in the first division is recorded more than once. This situation leads to more complex updating requirements and inefficient use of storage space and, finally, to inconsistent data.
Columns that contain the same type of data, have the same meaning, and are placed in the same table form what is known a repeating group. Imagine that each player must register the first name of his or her children (to participate in a club Christmas party). A possible table structure for the CHILDREN
table is:
PLAYERNO CNAME1 CNAME2 CNAME3 -------- ------- ------ ------ 6 Milly Diana Judy 44 ? ? ? 83 William Jimmy ?
The columns CNAME1
, CNAME2
, and CNAME3
form a repeating group. They contain the same type of data—that is, the name of a child. They also have the same significance; they are children belonging to one player. Up to three children may be registered for each player. What are the consequences of such a design, for example, for SELECT
and UPDATE
statements? An example of each follows.
Example 19.1. Find, for each player, the number of registered children.
SELECT PLAYERNO, 0 FROM CHILDREN WHERE CNAME1 IS NULL AND CNAME2 IS NULL AND CNAME3 IS NULL UNION SELECT PLAYERNO, 1 FROM CHILDREN WHERE CNAME1 IS NOT NULL AND CNAME2 IS NULL AND CNAME3 IS NULL UNION SELECT PLAYERNO, 2 FROM CHILDREN WHERE CNAME1 IS NOT NULL AND CNAME2 IS NOT NULL AND CNAME3 IS NULL UNION SELECT PLAYERNO, 3 FROM CHILDREN WHERE CNAME1 IS NOT NULL AND CNAME2 IS NOT NULL AND CNAME3 IS NOT NULL UNION SELECT PLAYERNO, 0 FROM PLAYERS WHERE PLAYERNO NOT IN (SELECT PLAYERNO FROM CHILDREN)
And the result is:
PLAYERNO ----------------- 6 3 44 0 83 2 2 0 27 0 : :
Example 19.2. The name Diana (one of the children belonging to player 6) must be changed to Diane.
First, we look for the column in which Diana is recorded:
SELECT * FROM CHILDREN WHERE PLAYERNO = 6
From the result, we see that the column CNAME2
must be updated. The UPDATE
statement looks like this:
UPDATE CHILDREN SET CNAME2 = 'Diane' WHERE PLAYERNO = 6
Allowing repeating groups makes many statements rather complex. The CHILDREN
table can also be designed without a repeating group:
The CHILDREN
table:
PLAYERNO CNAME -------- ------- 6 Milly 6 Diana 6 Judy 83 William 83 Jimmy
The primary key of this CHILDREN
table is formed by the columns PLAYERNO
and CNAME
. The formulation of the two previous statements looks as follows:
The first statement is:
SELECT PLAYERNO, COUNT(*) FROM PLAYERS LEFT OUTER JOIN CHILDREN ON PLAYERS.PLAYERNO = CHILDREN.PLAYERNO GROUP BY PLAYERNO
The second statement is:
UPDATE CHILDREN SET CNAME = 'Diane' WHERE PLAYERNO = 6 AND CNAME = 'Diana'
It seems that repeating groups often give rise to more complex statements and, therefore, should be avoided as much as possible. Moreover, the number of columns in a repeating group must be adapted to the maximum number of values possible. In the previous example, this might be as many as ten children! This applies to every row, of course, and puts excessive pressure on storage space.
The PLAYERS
table consists of 13 columns. Some of these could be joined to make a single column. For example, the columns STREET
, HOUSENO
, and TOWN
could be merged into one column called ADDRESS
. This can make some SELECT
statements easier to formulate.
Example 19.3. Get the address of player 44.
SELECT ADDRESS FROM PLAYERS WHERE PLAYERNO = 44
The result is:
ADDRESS -------------------------- 23 Lewis Street, Inglewood
On the other hand, other questions are very difficult to formulate:
To retrieve the TOWN
of a player, we have to use a complex expression in which some scalar functions are combined. We assume that the town name is preceded by a comma and a space:
SUBSTR(ADDRESS, LOCATE(',', ADDRESS) + 2, LENGTH(ADDRESS))
Note that not all SQL products support these functions. For those that do not, it is not possible to retrieve just a part of a column value with only one statement.
When you select rows on the basis of town name, the previous expression must be used. You can be sure that this statement will take a long time to process.
Selecting rows on the basis of the street name is not possible because the LIKE
operator must be used.
Selecting rows by house number will be impossible. Where does the house number begin? After a space? No, because some street names consist of several words. Does the house number begin at the first number? No, because some house numbers consist of letters. To answer this question, we have to include a special symbol between the street name and the house number.
Imagine that Edgecombe Way is renamed Park Way and that this change must be reflected throughout the table. SQL must look at each row separately and possibly perform the update.
These disadvantages are specific to this example but provide a clear basis for generalization.
A design that satisfies the guidelines given in the previous section simplifies the formulation of SELECT
and UPDATE
statements. Processing update statements is fast because each “fact” is registered only once. Processing SELECT
statements is another story. Precisely because each fact is recorded only once, many joins must be executed. Processing joins and other SELECT
statements can be very time-consuming. One way to tackle this problem is to include redundant data in a table. Here is an example of a join and another SELECT
statement, both of which can be executed faster when redundant data has been added.
Example 19.4. Get the name of each player who incurred at least one penalty.
SELECT NAME, AMOUNT FROM PENALTIES AS PEN, PLAYERS AS P WHERE PEN.PLAYERNO = P.PLAYERNO
SQL must perform a join to process this statement. The join can be avoided by storing the NAME
column as redundant data in the PENALTIES
table.
The new PENALTIES
table is:
PAYMENTNO PLAYERNO NAME PAYMENT_DATE AMOUNT --------- -------- --------- ------------ ------ 1 6 Parmenter 1980-12-08 100.00 2 44 Baker 1981-05-05 75.00 3 27 Collins 1983-09-10 100.00 4 104 Moorman 1984-12-08 50.00 5 44 Baker 1980-12-08 25.00 6 8 Newcastle 1980-12-08 25.00 7 44 Baker 1980-12-30 30.00 8 27 Collins 1984-11-12 75.00
The statement then becomes:
SELECT NAME, AMOUNT FROM PENALTIES
This SELECT
statement will definitely be executed faster than the previous one.
This method of adding redundant data is sometimes called denormalization. A disadvantage of denormalization is the need to store some facts more than once. The names of players, for example, are now recorded in the PLAYERS
table and in the PENALTIES
table. Updating the name of a player requires two separate update statements. Another disadvantage of denormalization is that recording the same fact in more than one place uses twice as much storage space. With denormalization, you must weigh the relative importance of faster execution time of SELECT
statements against slower execution time of updates and the storage space needed. In practice, this means that denormalization is used more often for data warehouses than for transaction databases.
Example 19.5. Get, for each player, the total amount of penalties incurred by him or her.
SELECT PLAYERNO, SUM(AMOUNT) AS TOTAL FROM PLAYERNO LEFT OUTER JOIN PENALTIES ON PLAYERS.PLAYERNO = PENALTIES.PLAYERNO GROUP BY PLAYERNO ORDER BY 1
The result is:
PLAYERNO TOTAL -------- ------ 2 0.00 6 100.00 7 0.00 8 25.00 27 175.00 28 0.00 39 0.00 44 130.00 57 0.00 83 0.00 95 0.00 100 0.00 104 50.00 112 0.00
This statement could be greatly simplified if the total amount of penalties were registered in the PLAYERS
table. The statement then would become:
SELECT PLAYERNO, TOT_AMOUNT FROM PLAYERS
In this way, the processing time would be greatly reduced, but here also the same disadvantages as for denormalization are found: more updates and duplication of data. We have to conclude, however, that the performance of OLAP and other business intelligence tools improves considerably as a result of denormalization.
In both examples, redundancy took the form of adding one extra column. Creating an entirely new redundant table is sometimes an attractive alternative. (Invent an example for yourself.) Guideline 5 can be derived from this story now.
The design of a database includes the task of choosing a data type for each column. This section presents a number of guidelines that can assist you in this choice.
In SELECT
and UPDATE
statements, columns are compared with one another. The columns WON
and LOST
are compared in the following statement:
SELECT MATCHNO FROM MATCHES WHERE WON - 2 > LOST
In contrast to this example, the following statement compares two columns that come from different tables:
SELECT NAME FROM PLAYERS INNER JOIN PENALTIES ON PLAYERS.PLAYERNO = PENALTIES.PLAYERNO
Two data types are the same if the data type (CHAR
, SMALLINT
, and so on) and the defined length are the same.
If you perform calculations on values in a column, the column must be defined with a numeric data type. Otherwise, the calculations might be impossible. Sometimes, you might be inclined to give a numeric data type to a column that records specific codes (with no intrinsic significance) that consist entirely of digits (for example, the league number). The advantage of a numeric column is that it requires little storage space. Coding systems, on the other hand, change frequently. During the design, all the codes might be numeric, but the question is whether that will always be the case. Conversions from numeric to alphanumeric values, should the need arise, are not simple. Therefore, define a column as numeric only if this is necessary for calculations.
Independent of the data type, a column must have a length defined that allows its longest value to be accommodated. Work out how long the longest value is for each column. Do not assume that the largest value is one of your existing values; think also about possible future values.
For columns with an alphanumeric data type, there is a choice between values with a variable length and values with a fixed length. Variable length has been designed to save storage space. In the first instance, it often appears to be the best choice, but be careful: It is not always as good as it seems. Columns with a variable length have two disadvantages. First, for each value in such a column, the length of the particular value is recorded (this is not visible for users). This, of course, uses extra storage space. Second, in SELECT
and UPDATE
statements, columns with a variable length perform more slowly than columns with a fixed length. The general guideline is to use alphanumeric data types with a variable length only if, on average, there would be at least 15 unused positions for most values.
When must you specify NOT NULL
behind a column in a CREATE TABLE
statement?
Never use the NULL
value in an artificial manner. Never use it to represent something other than an unknown value because working with NULL
values in calculations can be tricky, especially in conjunction with aggregation functions (see Chapter 9, “SELECT
Statement: SELECT
Clause and Aggregation Functions”).
Additionally, it must be stated that, for some products, such as DB2, for each value in a column that has not been defined NOT NULL
, an extra (invisible) byte is stored. This byte is used by the product to indicate whether the value is NULL
. In other words, a NOT NULL
column uses less storage space than an identical column with the same data type, but without NOT NULL
.
Closing remark: Many more factors and guidelines can influence the design of a database structure than we have mentioned here. Nevertheless, we have discussed the most important ones in this chapter.
18.223.33.157