1.2. Database Design

Activities related to “good” database design require the identification of end-user requirements and involve defining the structure of data values on a physical level. Database design begins with a conceptual view of what is needed. The next step, called logical design, consists of developing a formal description of database entities and relationships to satisfy user requirements. Seldom does a database consist of a single table. Consequently, tables of interrelated information are created to enable more complex and powerful operations on data. In the final step, referred to as physical design, the goal is to achieve optimal performance and efficient storage of the logical database.

1.2.1. Conceptual View

The health and well-being of a database depends on its database design. A database must be in balance (optimized) with all of its components to avoid performance and operation bottlenecks. Database design doesn’t just happen. It involves planning, modeling, creating, monitoring, and adjusting to satisfy the endless assortment of user requirements without exhausting available resources. Of central importance to database design is the process of planning. Planning is a valuable component that, when absent, causes a database to fall prey to a host of problems including poor performance and difficulty in operation. Database design consists of three distinct phases, as illustrated below.

1.2.2. Table Definitions

PROC SQL uses a model of data stored as sets rather than as physical files. A physical file consists of one or more records ordered sequentially or some other way. Programming languages such as COBOL and FORTRAN evolved to process files of this type by performing operations one record at a time. These languages were generally designed and used to mimic the way people process paper forms.

PROC SQL was designed to work with sets of data. Sets have no order and members of a set are of the same type using a data structure known as a table. A table is either a base table consisting of zero or more rows with one or more columns or a virtual table called a view (see Chapter 8, “Working with Views”).

1.2.3. Redundant Information

One of the rules of good database design is that data not be redundant or not be duplicated in the same database. The rationale for this is that if data appears more than once, then there is reason to believe that one of the pieces of data is likely to be in error. Another thing to watch for is the appearance of too many columns containing null values. When this occurs, the database is probably not designed properly. To alleviate potential table design issues, a process referred to as normalizing is performed. When properly done, this ensures the complete absence of redundant information in a table.

1.2.4. Normalization

Designing an optimal database design is an important element of database operations. It is also critical in achieving maximum performance and flexibility while working with tables and data. To minimize errors and duplication of data, database developers apply a concept called normalization to a logical database design.

The normalization process generally involves splitting larger multicolumn tables into two or more smaller tables containing fewer columns. The rationale for doing this is found in a set of data design guidelines called normal forms. The guidelines provide designers with a set of rules for converting one or two large database tables containing numerous columns into a normalized database consisting of multiple tables and only those columns that should be included in each table. The normalization process typically consists of no more than five steps with each succeeding step subscribing to the rules of the previous steps.

Normalizing a database helps to ensure that the database does not contain redundant information in two or more of its tables. As database designers and analysts proceed through the normalization process, many are not satisfied unless a database design is carried out to at least third normal form (3NF). Joe Celko in his popular book, SQL for Smarties: Advanced SQL Programming (Morgan Kaufmann, 1999), describes 3NF this way: “Informally, all the non-key columns are determined by the key, the whole key, and nothing but the key.”

While the normalization guidelines are extremely useful, some database purists actually go to great lengths to remove any and all table redundancies even at the expense of performance. This is in direct contrast to other database experts who follow the guidelines less rigidly in an attempt to improve the performance of a database by only going as far as the third step (or third normal form). Whatever your preference, you should keep this in mind as you normalize database tables. A fully normalized database often requires a greater number of joins and adversely affects the speed of queries. Celko mentions that the process of joining multiple tables is costly, specifically affecting processing time and computer resources.

1.2.5. Normalization Strategies

After transforming entities and attributes from the conceptual design into a logical design, the tables and columns are created. This is when a process known as normalization occurs. Normalization refers to the process of making your database tables subscribe to certain rules. Many, if not most, database designers are satisfied when third normal form (3NF) is achieved and, for the objectives of this book, I will stop at 3NF too. To help explain the various normalization steps, an example scenario will be given.

1.2.5.1. First Normal Form (1NF)

A table is considered to be in first normal form (1NF) when all of its columns describe the table completely and when each column in a row has only one value. A table satisfies 1NF when each column in a row has a single value and no repeating group information. Essentially every table meets 1NF as long as an array, list, or other structure has not been defined. The following example illustrates a table satisfying the 1NF rule because it has only one value at each row-and-column intersection. The table is in ascending order by CUSTNUM and consists of customers and the purchases they made at an office supply store.

  CUSTNUM  CUSTNAME  CUSTCITY       ITEM     UNITS
  UNITCOST  MANUCITY
        1  Smith     San Diego      Chair        1
   $179.00  San Diego
        1  Smith     San Diego      Pens        12
     $0.89  Los Angeles
        1  Smith     San Diego      Paper        4
    $76.95  Washington
        1  Smithe    San Diego      Stapler      1
     $8.95  Los Angeles
        7  Lafler    Spring Valley  Mouse Pad    1
    $11.79  San Diego
        7  Loffler   Spring Valley  Pens        24
     $1.59  Los Angeles
       13  Thompson  Miami          Markers      .
     $0.99  Los Angeles


1.2.5.2. Second Normal Form (2NF)

The very nature of leaving a table in first normal form (1NF) may present problems because of the repetition of some information in the table as shown in the example above. Another problem is that there are misspellings in the customer names. Although repeating information may be permissible with hierarchical file structures and other legacy type file structures, it does pose a potential data consistency problem as it relates to relational data.

To describe how data consistency problems can occur, let’s say that a customer takes a new job and moves to a new city. In changing the customer’s city to the new location, you might find it very easy to miss one or more occurrences resulting in a customer residing incorrectly in two different cities. Assuming that our table is only meant to track one unique customer per city, this would definitely be a data consistency problem.

Essentially, second normal form (2NF) is important because it says that every nonkey column must depend on the entire primary key.

Tables that subscribe to 2NF prevent the need to make changes in more than one place. What this means in normalization terms is that tables in 2NF have no partial key dependencies. As a result, our database consisting of a single table that satisfies 1NF will need to be split into two separate tables in order to subscribe to the 2NF rule. Each table would contain the CUSTNUM column to connect the two tables. Unlike the single table in 1NF, the tables in 2NF allow a customer’s city to be easily changed whenever they move to another city because the CUSTCITY column only appears once. The tables in 2NF would be constructed as follows.

CUSTOMERS Table
CUSTNUM   CUSTNAME    CUSTCITY
      1   Smith       San Diego
      1   Smithe      San Diego
      7   Lafler      Spring Valley
     13   Thompson    Miami

PURCHASES Table
CUSTNUM   ITEM     UNITS     UNITCOST   MANUCITY
      1   Chair        1      $179.00   San Diego
      1   Pens        12        $0.89   Los Angeles
      1   Paper        4        $6.95   Washington
      1   Stapler      1        $8.95   Los Angeles
      7   Mouse Pad    1       $11.79   San Diego
      7   Pens        24        $1.59   Los Angeles
     13   Markers      .        $0.99   Los Angeles

1.2.6. Third Normal Form (3NF)

Referring to the two tables constructed according to the rules of 2NF, you may have noticed that the PURCHASES table contains a column called MANUCITY. The MANUCITY column stores the city where the product manufacturer is headquartered. Keeping this column in the PURCHASES table violates the third normal form (3NF) because MANUCITY does not provide factual information about the primary key column in the PURCHASES table. Consequently, tables are considered to be in third normal form (3NF) when each column is “dependent on the key, the whole key, and nothing but the key.” The tables in 3NF are constructed so the MANUCITY column would be in a table of its own as follows.

CUSTOMERS Table
CUSTNUM   CUSTNAME    CUSTCITY
      1   Smith       San Diego
      1   Smithe      San Diego
      7   Lafler      Spring Valley
     13   Thompson    Miami

PURCHASES Table
CUSTNUM   ITEM     UNITS     UNITCOST
      1   Chair        1      $179.00
      1   Pens        12        $0.89
      1   Paper        4        $6.95
      1   Stapler      1        $8.95
      7   Mouse Pad    1       $11.79
      7   Pens        24        $1.59
     13   Markers      .        $0.99

MANUFACTURERS Table
MANUNUM   MANUCITY
    101   San Diego
    112   San Diego
    210   Los Angeles
    212   Los Angeles
    213   Los Angeles
    214   Los Angeles
    401   Washington

1.2.7. Beyond Third Normal Form

In general, database designers are satisfied when their database tables subscribe to the rules in 3NF. But it is not uncommon for others to normalize their database tables to fourth normal form (4NF) where independent one-to-many relationships between primary key and nonkey columns are forbidden. Some database purists will even normalize to fifth normal form (5NF) where tables are split into the smallest pieces of information in an attempt to eliminate any and all table redundancies. Although constructing tables in 5NF may provide the greatest level of database integrity, it is neither practical nor desired by most database practitioners.

There is no absolute right or wrong reason for database designers to normalize beyond 3NF as long as they have considered all the performance issues that may arise by doing so. A common problem that occurs when database tables are normalized beyond 3NF is that a large number of small tables are generated. In these cases, an increase in time and computer resources frequently occurs because small tables must first be joined before a query, report, or statistic can be produced.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.135.249