Chapter 19. Designing Tables

Introduction

In Chapters 15, “Creating Tables,” and 16, “Specifying Integrity Constraints,” we have shown which statements we can use to create tables with their integrity constraints. This chapter looks more closely at the process of designing a database structure. Before you can create a database, you must have designed a structure for it. During this design process, you decide which tables should be defined and which columns should be included in each table. The process of designing databases, then, is comparable to the work of an architect, while creating tables resembles the construction job.

For any given application, there are generally several possible table structures. This choice can be subject to different factors:

  • Storage space available

  • Maximum acceptable processing time for updates

  • Maximum acceptable processing time for SELECT statements

  • Security

Before starting the actual design, the designer must decide which factors are most relevant to the situation. Is it essential to save as much storage space as possible? Should SELECT statements take, at most, 3 seconds of processing time? Or should we take into consideration several different factors simultaneously?

Having to consider a combination of factors nearly always leads to conflicts. A savings in storage space, for example, means that SELECT statements will take longer to process. Looking at this another way, if every SELECT statement must be processed quickly, much of the data must be stored repeatedly, and this, of course, requires more storage space. Additionally, data redundancy leads to slower processing of updates because each logical update requires an update to more than one table.

Many different techniques exist for designing a table structure. We do not describe these techniques here because they fall outside the context of this book. However, in the following sections, we present ten basic guidelines for designing a database structure. For each guideline, we try to indicate its influence on the first four factors listed previously.

Which Tables and Columns?

Determining the tables and columns in a database design is the most important aspect of the design process. Much has already been written about the subject. We concentrate only on the most important guidelines. You can find more comprehensive coverage of the database design process in [DATE95] and [SIMS04].

Guideline 1: Define a Primary Key for Each Table

If a table has several candidate keys, you must make a choice. Always choose the one that consists of the smallest number of columns as the primary key. This simplifies the process of joining tables for, among other things, SELECT statements.

If this criterion does not lead to a good solution for some reason, choose the column that contains “internal” values. By “internal,” we mean values for which your own organization is the owner. For example, do not choose passport number as a key because the government is the owner of this number.

If this criterion does not work, choose the primary key that uses the least amount of storage space—for example, a CHAR(5) column instead of a VARCHAR(30) column.

Guideline 2: Each Determinant in a Table Must Be a Candidate Key of That Table

This guideline is often referred to in the literature as the Boyce-Codd normal form; see, among others, [DATE95].

This is the first time we have used the term determinant, so it needs to be explained. Column A is a determinant of column B if, for each different value in A there is, at most, one associated value in B. The PLAYERNO column in the PLAYERS table, for example, is a determinant of all other columns in the table. A determinant can consist of more than one column.

Imagine that the column DETER in table T is a determinant of column C. The following SELECT statement will never return a result (work it out for yourself):

SELECT   DETER
FROM     T
GROUP BY DETER
HAVING   COUNT(DISTINCT C) > 1

Here is an example of a table design that does not follow the second guideline:

PLAYERNO  NAME       TEAMNO  DIVISION
--------  ---------  ------  --------
       6  Parmenter  1       first
      44  Baker      1       first
      27  Collins    2       second
     104  Moorman    2       second

The PLAYERNO column is the primary key. Thus, the table follows the first guideline. The determinant of the NAME column is PLAYERNO. This is also true for TEAMNO and DIVISION; every PLAYERNO belongs to, at most, one TEAMNO and one DIVISION. However, TEAMNO is also a determinant of DIVISION because every TEAMNO has, at most, one associated DIVISION. TEAMNO, then, is a determinant but not a candidate key. The conclusion is that the table does not follow the second guideline.

The most significant disadvantage of a table that does not comply with the second guideline is that certain “facts” are recorded several times. In the previous example, the fact that team 1 plays in the first division is recorded more than once. This situation leads to more complex updating requirements and inefficient use of storage space and, finally, to inconsistent data.

Guideline 3: Do Not Use Repeating Groups in a Table

Columns that contain the same type of data, have the same meaning, and are placed in the same table form what is known a repeating group. Imagine that each player must register the first name of his or her children (to participate in a club Christmas party). A possible table structure for the CHILDREN table is:

PLAYERNO  CNAME1   CNAME2  CNAME3
--------  -------  ------  ------
       6  Milly    Diana   Judy
      44  ?        ?       ?
      83  William  Jimmy   ?

The columns CNAME1, CNAME2, and CNAME3 form a repeating group. They contain the same type of data—that is, the name of a child. They also have the same significance; they are children belonging to one player. Up to three children may be registered for each player. What are the consequences of such a design, for example, for SELECT and UPDATE statements? An example of each follows.

Example 19.1. Find, for each player, the number of registered children.

SELECT   PLAYERNO, 0
FROM     CHILDREN
WHERE    CNAME1 IS NULL
AND      CNAME2 IS NULL
AND      CNAME3 IS NULL
UNION
SELECT   PLAYERNO, 1
FROM     CHILDREN
WHERE    CNAME1 IS NOT NULL
AND      CNAME2 IS NULL
AND      CNAME3 IS NULL
UNION
SELECT   PLAYERNO, 2
FROM     CHILDREN
WHERE    CNAME1 IS NOT NULL
AND      CNAME2 IS NOT NULL
AND      CNAME3 IS NULL
UNION
SELECT   PLAYERNO, 3
FROM     CHILDREN
WHERE    CNAME1 IS NOT NULL
AND      CNAME2 IS NOT NULL
AND      CNAME3 IS NOT NULL
UNION
SELECT   PLAYERNO, 0
FROM     PLAYERS
WHERE    PLAYERNO NOT IN
        (SELECT   PLAYERNO
         FROM     CHILDREN)

And the result is:

PLAYERNO
-----------------
       6        3
      44        0
      83        2
       2        0
      27        0
       :        :

Example 19.2. The name Diana (one of the children belonging to player 6) must be changed to Diane.

First, we look for the column in which Diana is recorded:

SELECT   *
FROM     CHILDREN
WHERE    PLAYERNO = 6

From the result, we see that the column CNAME2 must be updated. The UPDATE statement looks like this:

UPDATE    CHILDREN
SET       CNAME2 = 'Diane'
WHERE     PLAYERNO = 6

Allowing repeating groups makes many statements rather complex. The CHILDREN table can also be designed without a repeating group:

The CHILDREN table:

PLAYERNO  CNAME
--------  -------
       6  Milly
       6  Diana
       6  Judy
      83  William
      83  Jimmy

The primary key of this CHILDREN table is formed by the columns PLAYERNO and CNAME. The formulation of the two previous statements looks as follows:

The first statement is:

SELECT   PLAYERNO, COUNT(*)
FROM     PLAYERS LEFT OUTER JOIN CHILDREN
         ON PLAYERS.PLAYERNO = CHILDREN.PLAYERNO
GROUP BY PLAYERNO

The second statement is:

UPDATE   CHILDREN
SET      CNAME = 'Diane'
WHERE    PLAYERNO = 6
AND      CNAME = 'Diana'

It seems that repeating groups often give rise to more complex statements and, therefore, should be avoided as much as possible. Moreover, the number of columns in a repeating group must be adapted to the maximum number of values possible. In the previous example, this might be as many as ten children! This applies to every row, of course, and puts excessive pressure on storage space.

Guideline 4: Do Not Concatenate Columns

The PLAYERS table consists of 13 columns. Some of these could be joined to make a single column. For example, the columns STREET, HOUSENO, and TOWN could be merged into one column called ADDRESS. This can make some SELECT statements easier to formulate.

Example 19.3. Get the address of player 44.

SELECT   ADDRESS
FROM     PLAYERS
WHERE    PLAYERNO = 44

The result is:

ADDRESS
--------------------------
23 Lewis Street, Inglewood

On the other hand, other questions are very difficult to formulate:

  • To retrieve the TOWN of a player, we have to use a complex expression in which some scalar functions are combined. We assume that the town name is preceded by a comma and a space:

    SUBSTR(ADDRESS, LOCATE(',', ADDRESS) + 2, LENGTH(ADDRESS))
    

    Note that not all SQL products support these functions. For those that do not, it is not possible to retrieve just a part of a column value with only one statement.

  • When you select rows on the basis of town name, the previous expression must be used. You can be sure that this statement will take a long time to process.

  • Selecting rows on the basis of the street name is not possible because the LIKE operator must be used.

  • Selecting rows by house number will be impossible. Where does the house number begin? After a space? No, because some street names consist of several words. Does the house number begin at the first number? No, because some house numbers consist of letters. To answer this question, we have to include a special symbol between the street name and the house number.

  • Imagine that Edgecombe Way is renamed Park Way and that this change must be reflected throughout the table. SQL must look at each row separately and possibly perform the update.

These disadvantages are specific to this example but provide a clear basis for generalization.

Exercise 19.1:

Which column in the PLAYERS table does not obey the fourth guideline?

Exercise 19.2:

Create an alternative design for the PLAYERS table so that it follows the second and fourth guidelines. Here, we assume that for each town there exists only one area code.

Adding Redundant Data

A design that satisfies the guidelines given in the previous section simplifies the formulation of SELECT and UPDATE statements. Processing update statements is fast because each “fact” is registered only once. Processing SELECT statements is another story. Precisely because each fact is recorded only once, many joins must be executed. Processing joins and other SELECT statements can be very time-consuming. One way to tackle this problem is to include redundant data in a table. Here is an example of a join and another SELECT statement, both of which can be executed faster when redundant data has been added.

Example 19.4. Get the name of each player who incurred at least one penalty.

SELECT   NAME, AMOUNT
FROM     PENALTIES AS PEN, PLAYERS AS P
WHERE    PEN.PLAYERNO = P.PLAYERNO

SQL must perform a join to process this statement. The join can be avoided by storing the NAME column as redundant data in the PENALTIES table.

The new PENALTIES table is:

PAYMENTNO  PLAYERNO  NAME       PAYMENT_DATE  AMOUNT
---------  --------  ---------  ------------  ------
        1         6  Parmenter  1980-12-08    100.00
        2        44  Baker      1981-05-05     75.00
        3        27  Collins    1983-09-10    100.00
        4       104  Moorman    1984-12-08     50.00
        5        44  Baker      1980-12-08     25.00
        6         8  Newcastle  1980-12-08     25.00
        7        44  Baker      1980-12-30     30.00
        8        27  Collins    1984-11-12     75.00

The statement then becomes:

SELECT   NAME, AMOUNT
FROM     PENALTIES

This SELECT statement will definitely be executed faster than the previous one.

This method of adding redundant data is sometimes called denormalization. A disadvantage of denormalization is the need to store some facts more than once. The names of players, for example, are now recorded in the PLAYERS table and in the PENALTIES table. Updating the name of a player requires two separate update statements. Another disadvantage of denormalization is that recording the same fact in more than one place uses twice as much storage space. With denormalization, you must weigh the relative importance of faster execution time of SELECT statements against slower execution time of updates and the storage space needed. In practice, this means that denormalization is used more often for data warehouses than for transaction databases.

Example 19.5. Get, for each player, the total amount of penalties incurred by him or her.

SELECT   PLAYERNO, SUM(AMOUNT) AS TOTAL
FROM     PLAYERNO LEFT OUTER JOIN PENALTIES
         ON PLAYERS.PLAYERNO = PENALTIES.PLAYERNO
GROUP BY PLAYERNO
ORDER BY 1

The result is:

PLAYERNO   TOTAL
--------  ------
       2    0.00
       6  100.00
       7    0.00
       8   25.00
      27  175.00
      28    0.00
      39    0.00
      44  130.00
      57    0.00
      83    0.00
      95    0.00
     100    0.00
     104   50.00
     112    0.00

This statement could be greatly simplified if the total amount of penalties were registered in the PLAYERS table. The statement then would become:

SELECT   PLAYERNO, TOT_AMOUNT
FROM     PLAYERS

In this way, the processing time would be greatly reduced, but here also the same disadvantages as for denormalization are found: more updates and duplication of data. We have to conclude, however, that the performance of OLAP and other business intelligence tools improves considerably as a result of denormalization.

In both examples, redundancy took the form of adding one extra column. Creating an entirely new redundant table is sometimes an attractive alternative. (Invent an example for yourself.) Guideline 5 can be derived from this story now.

Guideline 5: Add Redundant Data When the Processing Time of SELECT Statements Is Not Acceptable

Exercise 19.3:

Create a design for the MATCHES table so that the following query would no longer need a join. Additionally, give the changed formulation of the SELECT statement.

SELECT   M.MATCHNO, M.TEAMNO, T.DIVISION
FROM     MATCHES AS M INNER JOIN TEAMS AS T
         ON M.TEAMNO = T.TEAMNO

Exercise 19.4:

Give an alternative design for the PLAYERS table so that the following query can be answered without joins and subqueries: Find the number and name of each player who has, for a given team, won more sets than the average number of sets won for any team.

Choosing a Data Type for a Column

The design of a database includes the task of choosing a data type for each column. This section presents a number of guidelines that can assist you in this choice.

Guideline 6: Use the Same Data Types for Columns That Will Be Compared with One Another

In SELECT and UPDATE statements, columns are compared with one another. The columns WON and LOST are compared in the following statement:

SELECT   MATCHNO
FROM     MATCHES
WHERE    WON - 2 > LOST

In contrast to this example, the following statement compares two columns that come from different tables:

SELECT   NAME
FROM     PLAYERS INNER JOIN PENALTIES
         ON PLAYERS.PLAYERNO = PENALTIES.PLAYERNO

Two data types are the same if the data type (CHAR, SMALLINT, and so on) and the defined length are the same.

Portability

Some SQL products process statements that compare columns of different data types extremely slowly.

Guideline 7: Assign a Column a Numeric Data Type Only if It Will Be Used in Calculations

If you perform calculations on values in a column, the column must be defined with a numeric data type. Otherwise, the calculations might be impossible. Sometimes, you might be inclined to give a numeric data type to a column that records specific codes (with no intrinsic significance) that consist entirely of digits (for example, the league number). The advantage of a numeric column is that it requires little storage space. Coding systems, on the other hand, change frequently. During the design, all the codes might be numeric, but the question is whether that will always be the case. Conversions from numeric to alphanumeric values, should the need arise, are not simple. Therefore, define a column as numeric only if this is necessary for calculations.

Guideline 8: Do Not Skimp on the Length of Columns

Independent of the data type, a column must have a length defined that allows its longest value to be accommodated. Work out how long the longest value is for each column. Do not assume that the largest value is one of your existing values; think also about possible future values.

Guideline 9: Do Not Use the Data Type VARCHAR for All Alphanumeric Columns

For columns with an alphanumeric data type, there is a choice between values with a variable length and values with a fixed length. Variable length has been designed to save storage space. In the first instance, it often appears to be the best choice, but be careful: It is not always as good as it seems. Columns with a variable length have two disadvantages. First, for each value in such a column, the length of the particular value is recorded (this is not visible for users). This, of course, uses extra storage space. Second, in SELECT and UPDATE statements, columns with a variable length perform more slowly than columns with a fixed length. The general guideline is to use alphanumeric data types with a variable length only if, on average, there would be at least 15 unused positions for most values.

When Should You Use NOT NULL?

When must you specify NOT NULL behind a column in a CREATE TABLE statement?

Guideline 10: Use NOT NULL When a Column Must Contain a Value for Every Row

Never use the NULL value in an artificial manner. Never use it to represent something other than an unknown value because working with NULL values in calculations can be tricky, especially in conjunction with aggregation functions (see Chapter 9, “SELECT Statement: SELECT Clause and Aggregation Functions”).

Portability

Additionally, it must be stated that, for some products, such as DB2, for each value in a column that has not been defined NOT NULL, an extra (invisible) byte is stored. This byte is used by the product to indicate whether the value is NULL. In other words, a NOT NULL column uses less storage space than an identical column with the same data type, but without NOT NULL.

Closing remark: Many more factors and guidelines can influence the design of a database structure than we have mentioned here. Nevertheless, we have discussed the most important ones in this chapter.

Exercise 19.5:

Design a database for recording data about one-man shows. For each show, the name of the show and the name of the artist should be recorded. The location and date of each performance should also be recorded, as well as the names of participating musicians and their instruments—at most, two instruments for each musician. The musical setting (that is, musicians and instruments) will be the same for each performance of a particular show. Of course, each musician may take part in several shows.

Write the necessary CREATE TABLE statements, including primary and foreign keys. (Determine suitable data types for the columns yourself.)

Answers

19.1

The PHONENO column contains the area code and the subscription number. It is better, therefore, to replace it with two columns.

19.2

A determinant of the AREACODE column is the TOWN column. (For each town, there is a maximum of one area code.) A separate table must be created with the columns TOWN (primary key) and AREACODE. The AREACODE column then disappears from the PLAYERS table. The columns that remain are PLAYERNO, NAME, INITIALS, BIRTH_DATE, SEX, JOINED, STREET, HOUSENO, TOWN, SUBSCRIPNO, and LEAGUENO.

19.3

The MATCHES table must be extended with a column called DIVISION, in which the division in which the match has been played is recorded. The SELECT statement would then look like this:

SELECT   MATCHNO, TEAMNO, DIVISION
FROM     MATCHES

19.4

The PLAYERS table must have two columns added: WON and AVERAGE. The first column contains the total number of matches won by the player; the second column presents the average number of matches won. The statement would take the following form:

SELECT   PLAYERNO, NAME
FROM     PLAYERS
WHERE    WON > AVERAGE

19.5

CREATE TABLE PERFORMANCE
       (NAME_SHOW   CHAR(20) NOT NULL,
       LOCATION     CHAR(20) NOT NULL,
       PERF_DATE    DATE NOT NULL,
       PRIMARY KEY (NAME_SHOW, LOCATION, PERF_DATE))

CREATE TABLE SHOWS
       (NAME_SHOW  CHAR(20) NOT NULL,
       ARTIST      CHAR(20) NOT NULL,
       PRIMARY KEY (NAME_SHOW))

CREATE TABLE SETTING
       (NAME_SHOW   CHAR(20) NOT NULL,
       MUSICIAN     CHAR(20) NOT NULL,
       INSTRUMENT   CHAR(20) NOT NULL,
       PRIMARY KEY  (NAME_SHOW, MUSICIAN, INSTRUMENT))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.33.157