© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2021
W. D. AshleyFoundation Db2 and Pythonhttps://doi.org/10.1007/978-1-4842-6942-8_4

4. Database Physical Design

W. David Ashley1  
(1)
Austin, TX, USA
 

The physical design of a relational database is a two-phase project. The first phase involves the design of the tables and their relationships. The second phase lays out the table design onto the physical drives provided by the system. Both of these phases are necessary for proper optimization of the database information. But there are inevitable conflicts between the two phases that will need to be resolved during performance testing.

This chapter will cover each of these phases and highlight where possible conflicts can reduce performance of the database. Do not be put off by the concepts covered in this chapter. These topics will only be covered in the depth necessary for a good understanding of where things can go wrong. Programmers may think that this chapter does not apply to them, but they would be wrong in that assumption. Programmers need to understand the choices made by the database designers so that they can properly compose their SQL statements to make the best use of the design.

Phase 1 of the physical design of a relational database involves gathering the data and a process known as normalization of the data. Normalization is the process of organizing table data in such a way as to reduce the dependencies on other data either within the same table or other tables. The normalization process has three steps. Each step puts a table into first normal form, second normal form, or third normal form. All of these steps involve establishing creating tables to contain just the data needed to make it useful to the users. The depth to which that is taken determines the level of the normalization.

Phase 2 of the physical design places the tables created in Phase 1 and their corresponding tablespaces onto a physical drive. The drive might be one of the new SSD memory devices or a spinning disk or RAID system. Less used data may even be placed on even slower media or a remote system. The type of media used for storage is strictly determined by the performance requirements of the data being stored.

The basic problem for the database designer(s) is to balance the needs of the users with the system and media available for the database. This really means that both phases should be done together. Performing each phase individually risks one of the phases impacting the other phase in ways that may cause the previous phase to be partially redone.

Phase 1: Data Gathering and Normalization

This phase gathers the data that will make up the database and then normalizes it. This phase is also known as the logical model of the data.

One thing that needs to be made clear for this phase is that the users of the data need to be heavily involved in gathering and organizing the data for the database. If nothing else, this will help you discover the kind of queries that will be made on the data. The queries will help you to organize the data in such a way that both improves query performance and creates a design that is both logical and helpful to the users.

Data Gathering

This task may involve a number of groups, especially if the data is coming from an existing source. The following is a list of potential groups that may need to be involved in gathering the data:
  • The end users of the database to be designed

    This group of people will be using the data contained by your database. They may also have insights from any existing source data for the new database. They may/may not design queries, but they have expectations for the information they can gain from the data.

  • Owners of the source data for the database

    This group are the owners of any source data to be used for the new database. If any changes/additions to this data are necessary, you will need to notify and give ownership of these requested changes to this group.

  • The group that maintains the physical hardware for the database

    This group will be the maintainers of the physical hardware housing the database. As the size of the new database becomes known, this group will be responsible for maintaining that hardware. Also, if hardware needs to be purchased, this group will most likely be involved in the purchasing and installation of the hardware.

  • The group responsible for data backup

    This group maintains any backups and restore images for the new database. Hardware may need to be purchased and installed for this task, so this group will need to be in the loop for that task.

  • IT management for the database

    Management will need to be involved to approve any new hardware purchase and/or new personnel for database maintenance.

As you can see, there are any number of people involved in designing a database. In your organization, there may even be other groups that need to be brought in at different points in the process.

Data gathering involves an array of potential sources depending on the type of data. The sources could involve other databases, data on tapes, online sources, automated inputs, manual data entry inputs, and even sources from other companies or governmental agencies. How all of your sources deliver the data can have an impact on the timeliness of the data and could have an impact on the queries made on the database.

Once you have an idea of the data coming in, it will be time to organize it. You and the final users will need to define the potential entities easily recognized in the data. In addition, you may be able to reorganize the data to form new entities that better reflect the organization of the data. Of course, when it comes time to create the table organization, you may need additional tables and/or indexes that the users will not use directly to maintain the data/table relationships.

Any table organization you create at this time should be considered temporary. This should be made very clear to your users. There is still a long way to go before the design is stable, and everyone should be aware of this.

Data Normalization

Normalization involves removing redundancies from your tables and removing entities from your tables that are not actually an attribute of the table entity. This is very much like organizing a table in the same way you would design an OOP class definition. Things that are not attributes of the main table entity should be removed from the table. Usually one or more new tables will need to be created to hold this removed data.

Normalization involves three forms, or steps:
  • First normal form: The information is stored in a table with each column containing atomic values. There are no repeating groups of columns.

  • Second normal form: The tables are in first normal form, and all the columns depend on the primary key.

  • Third normal form: The table is stored in second normal form, and all of its columns are not transitively dependent on the primary key.

First Normal Form

First normal form involves checking if any column is a combination of any two or more of the other columns or if there are repeated columns with the same or similar information.

For instance, a column that contains two other columns added together falls into this category. It turns out that it is very easy for the database to perform this calculation for a query, so a summary column is considered redundant. The same would apply to string data columns concatenated together to form a third column. Always let the database perform this data transformation for you. One thing this will prevent is potential data corruption when one of the columns involved in the transformation is updated with a new value.

Let’s take a look at a table that contains summary data:
CREATE TABLE employee (
      VARCHAR(50) name NOT NULL,
      . . .
      DECIMAL(10,2) salary,
      DECIMAL(10,2) bonus,
      DECIMAL(10,2) total_salary,
      . . .
      );

When the employee is created, the salary and bonus are known, and the input for the total_salary is calculated. This is not first normal form because we have a field contained in the table that is calculated from two other fields in the table. This is very prone to errors later on when either the salary or the bonus is updated and the total_salary is forgotten. This will obviously cause a referential integrity problem. In this case, the total_salary field should be removed, and if that value is needed later on, have the database add the salary and bonus fields together during the query. This will have very little impact on the query and preserves referential integrity.

Our second example involves keeping repeated information in a table that is not directly related to the main table entity. The following is an example of such a table:
CREATE TABLE customer (
      . .
      VARCHAR(50) name NOT NULL,
      . . .
      INT ordernum1,
      INT ordernum2,
      INT ordernum3,
      . . .
      )

The idea here is that the last three orders for a customer are kept with the customer data for convenience. This is bad because the order of the three order numbers must be kept intact. This turns out to be really hard to do. The database can easily pull the most recent orders from the order table, so you should let it do its job and remove these fields from the customer table.

Second Normal Form

Second normal form removes any column that does not depend on the table’s primary key. This will prevent any partial dependencies between table fields. For instance, you would not store a customer order in the table that stores the customer information. A separate order table should be created to hold this information. After all, a customer may have multiple active orders at any time.

The following is an example of a violation of second normal form:
CREATE TABLE part (
      VARCHAR(50) partnum PRIMARY KEY NOT NULL,
      VARCHAR(50) warehouse PRIMARY KEY NOT NULL,
      INT quantity NOT NULL,
      VARCHAR(50) warehouse_addr NOT NULL
      )

In this example, the warehouse_addr is not an attribute of part. It is an attribute of the warehouse table, so it does not belong in the part table.

Third Normal Form

Third normal form is a little harder to recognize. A good example of this would be a hospital database where patient data is stored. If the patient table contains the doctor’s phone number, then that data does not depend on the patient. That data belongs to the doctor table. That column should be replaced by a key reference to the doctor, not an attribute of the doctor entity. This is known as removing a transitive dependency.

The following example shows how a violation of third normal form can cause referential integrity problems:
CREATE TABLE employee (
      INT emp_num PRIMARY KEY NOT NULL,
      VARCHAR(20) first_name NOT NULL,
      VARCHAR(20) last_name NOT NULL,
      CHAR(10) work_dept NOT NULL,
      VARCHAR(20) dept_name NOT NULL
      )

If the dept_name needs to be changed (but not the work_dept), then if you miss a single employee in that department, you now have a referential integrity problem. The dept_name belongs to the department table, not the employee table, so that column should be removed.

One last thing to point out is that third normal form, when implemented, can cause performance problems. Many database designers have taken their designs to third normal form only to find that they have to fall back so that the data is directly in a table that is not in third normal form to have proper performance for the users.

Fourth Normal Form

There is a fourth normal form, but we will not discuss this as it requires new tables to be created to resolve the problem. This can sometimes lead to both confusion and frustration on the users creating queries. This hassle may or may not be worth it for your database. The exception might be for really large databases where space is at a premium because fourth normal form can reduce the space needed by the database.

Business Rules

Once you have the major tables designed, it is time to consider the business rules that will be needed to automatically keep tables in sync. This activity will ease the process of adding and updating your tables. Several tables may have direct relationships that need to be maintained. If a new row is added to a table, there may have to be dependencies that also need to be maintained. Business rules can ease this task by making automatic updates/additions/deletions when a dependent table is updated.

Db2 has many facilities to help in these cases. There are so many of them that we have devoted a whole chapter to cover them. Refer to Chapter 6.

Phase 2: Physical Design of the Database

Once you have the tables, indexes, and business rules designed, you need to lay out the tables on a set of disk drives. There are a number of ways to do this depending on the hardware you have to work with and the size of your database.

A small database, one up to a few gigabytes, can easily be placed on a single disk drive without a major performance penalty. However, the task gets harder the larger the database becomes. If you start with a small database and it begins to grow, then performance becomes a major concern. So not only should you consider the size of the database today, you also need to be aware about its potential growth rate and design an appropriate solution.

In the old days when disk drives were rather small and performance restricted, even a small database needed to be designed around the performance of the disk drives. A single table might need to span several drives so that they could work together asynchronously on a query. Today, drives are enormous in size, some even 8 and 10 TB in size, and getting larger all the time. Beware of these drives. Trying to use all the space available on these drives can cause huge performance impacts just because of the seek time on the disk.

A lot of the problems with big disk drives can be resolved to some extent by creating a RAID array. RAIDs 5 and 6 are fairly good alternatives where the data in a table can be spread across all the drives in the array. Thus, if eight drives make up the array, then all eight might be tasked to work on retrieving data from a table. But there are potential problems with this solution. In the early days, there were big performance problems when writing data to an array. These penalties have been partly eliminated over the years, but in some respects they still exist.

RAID 6 was an attempt to eliminate the write performance penalties of RAID 5. To a large extent, the implementation succeeded without adding new penalties or increasing penalties to existing features. The main penalty to RAID 6 is the larger overhead needed to accommodate storage space for cache and temporary storage. This, as it turns out, can be completely ignored on larger sets of today’s disks. With their larger sizes, today’s disks can easily accommodate the necessary overhead.

The solution that mainframes use is to use a single or a set of drives to store a large single table. But this is not a good solution for small- and midsize systems because with drives of today, there will be a lot of wasted space on the drive(s).

One relatively new kind of drive that is now available can resolve some of these problems. These are the SSDs and other memory-based drives. These drives have outstanding performance for both reading and writing. However, they are still relatively small in size, and the cost can be prohibitive to purchase a large number of them. But in the future, these may become the drive of choice for many databases because the performance is so good.

Backups

This is the time to start thinking about backups of your database. Don’t put this task off until later as that will cause you headaches and problems later on. If you are doing backups to disk, you will require additional disk space to accommodate them. For tape, you may need a dedicated tape drive for the database. The size of your database may determine your backup options. All this needs to be considered up front during the design process, not later.

Db2 has a dedicated backup command run from the db2 prompt. You should use this command if at all possible as it has encryption and compression options, as well as being able to perform incremental backups. It has multiple options for performing all different kinds of backups at the database, page, and tablespace levels. It can perform backups to any device as long as it is mounted to the file system. The one exception to this is performing a backup to another Db2 remote database. This can be done over a number of connections to the local system.

Do not try to finalize your backup solution during the planning process. The idea is to size your option and try to select an option that can be finalized later.

Summary

This chapter has given a high-level overview of the tasks necessary to collect the data necessary and then perform a first-level logical and physical design of your database. We covered all the major topics for accomplishing this goal with the appropriate people making contributions to that effort.

It should be remembered this has only been a short review. You should consult additional material should you actually be tasked with this job. What we have presented here is only a guide to the high-level tasks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.233.6