Chapter 5

Normalization

Abstract

Normal forms are how we set up correct relations among tables in RDBMS.

Keywords

Normal forms

Dr. E.F. Codd

Codd's 12 rules

First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

Elementary Key Normal Form (EKNF)

Boyce-Codd Normal Form (BCNF)

Fifth Normal Form (5NF)

Domain-Key Normal Form (DKNF)

Non-Normal Form Redundancy

Attribute Splitting

The Relational Model and the Normal Forms of the Relational Model were first defined by Dr. E.F. Codd (1970) then extended by other writers after him. He invented the term “normalized relations” by borrowing from the political jargon of the day. A branch of mathematics called relations deals with mappings among sets defined by predicate calculus from formal logic. Just as in an algebraic equation, there are many forms of the same relational statement, but the “normal forms” of relations are certain formally defined desirable constructions. The goal of normal forms is to avoid certain data anomalies that can occur in unnormalized tables. Data anomalies are easier to explain with an example, but first please be patient while I define some terms. A predicate is a statement of the form A(X), which means that X has the property A. For example, “John is from Indiana” is a predicate statement; here “John” is the subject and “is from Indiana” is the predicate. A relation is a predicate with two or more subjects. “John and Bob are brothers” is an example of a relation. The common way of visualizing a set of relational statements is as a table where the columns are attributes of the relation and each row is a specific relational statement.

When Dr. Codd defined the relational model, he gave 0-12 rules for the visualization of the relation as a table:

0. (Yes, there is a rule zero.) For a system to qualify as a relational database management system, that system must use its relational facilities (exclusively) to manage the database. SQL is not so pure on this rule, since you can often do procedural things to the data.

1. The information rule: this simply requires all information in the database to be represented in one and only one way, namely by values in columns within rows of tables. SQL is good here, but columns are found by their names and not by their positions in a row in a strict RDBMS model. SQL allows the use of a * as shorthand for a list of column names.

2. The guaranteed access rule: This rule is essentially a restatement of the fundamental requirement for primary keys. It states that every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column, and a key value of the containing row. SQL follows this rule for tables that have a key, but SQL does not require a table to have a key at all.

3. Systematic treatment of NULL values: The DBMS is required to support a representation of missing information and inapplicable information that is systematic, distinct from all regular values, and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way. SQL has a NULL that is used for both missing information and inapplicable information, rather than having two separate tokens as Dr. Codd wished in his second version of the Relational Model (Codd, E.F., The Relational Model for Database Management: Version 2 Hardcover, May 1, 2000; ISBN: 978-0201141924).

4. Active online catalog based on the relational model: The system is required to support an online, inline, relational catalog that is accessible to authorized users by means of their regular query language. SQL does this.

5. The comprehensive data sublanguage rule: The system must support at least one relational language that (a) has a linear syntax, (b) can be used both interactively and within application programs, and (c) supports data definition operations (including view definitions), data manipulation operations (update as well as retrieval), security and integrity constraints, and transaction management operations (begin, commit, and rollback).
SQL is pretty good on this point, since all of the operations Codd defined can be written in the DML (Data Manipulation Language).

6. The VIEW updating rule: All views that are theoretically updatable must be updatable by the system. SQL is weak here and has elected to standardize on the safest case. View updatability is now known to be NP-complete and therefore impossible to enforce in general. INSTEAD OF triggers in SQL allow solutions for particular schemas.

7. High-level insert, update, and delete: The system must support set-at-a-time INSERT, UPDATE, and DELETE operators. SQL does this.

8. Physical data independence: This is self-explanatory; users are never aware of the physical implementation and deal only with a logical model. Any real product is going to have some physical dependence, but SQL is better than most programming languages on this point. In particular, auto-incrementing row identifiers based on physical insertions into a table like the IDENTITY table property in MS SQL Server are in total violation of this rule.

9. Logical data independence: This is self-explanatory. SQL is quite good about this point until you start using vendor extensions.

10. Integrity independence: Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications. Standard SQL has this.

11. Distribution independence: Existing applications should continue to operate successfully (a) when a distributed version of the DBMS is first introduced and (b) when existing distributed data are redistributed around the system. We are just starting to get distributed versions of SQL, so it is a little early to say whether SQL will meet this criterion or not.

12. The nonsubversion rule: If the system provides a low-level (record-at-a-time, bit level) interface, that interface cannot be used to subvert the system (e.g., bypassing a relational security or integrity constraint). SQL is good about this one.

Codd also specified 9 structural features, 3 integrity features, and 18 manipulative features, all of which are required as well. He later extended the list from 12 rules to 333 in the second version of the relational model. This section is getting too long and you can look them up for yourself.

Normal forms are an attempt to make sure that you do not destroy true data or create false data in your database. One of the ways of avoiding errors is to represent a fact only once in the database, since if a fact appears more than once, one of the instances of it is likely to be in error—a man with two watches can never be sure what time it is.

This process of table design is called normalization. It is not mysterious, but it can get complex. You can buy CASE tools to help you do it, but you should know a bit about the theory before you use such a tool.

5.1 Functional and Multivalued Dependencies

A normal form is a way of classifying a table based on the functional dependencies (FDs) in it. A functional dependency means that if I know the value of one attribute, I can always determine the value of another. The notation used in relational theory is an arrow between the two attributes, for example A arrow B, which can be read in English as “A determines B.” If I know your employee number, I can determine your name; if I know a part number, I can determine the weight and color of the part; and so forth.

A multivalued dependency (MVD) means that if I know the value of one attribute, I can always determine the values of a set of another attribute. The notation used in relational theory is a double-headed arrow between the two attributes, for instance A→B, which can be read in English as “A determines many Bs.” If I know a teacher’s name, I can determine a list of her students; if I know a part number, I can determine the part numbers of its components; and so forth.

5.2 First Normal Form (1NF)

Consider a requirement to maintain data about class schedules at a school. We are required to keep the course_name, class_section, dept_name, time, room_nbr, professor, student, student_major, and student_grade. Suppose that we initially set up a Pascal file with records that look like this:

Classes = RECORD
course_name:   ARRAY [1:7] OF CHAR;
class_section: CHAR;
time_period: INTEGER;
room_nbr: INTEGER;
room_size: INTEGER;
professor: ARRAY [1:25] OF CHAR;
dept_name: ARRAY [1:10] OF CHAR;
students: ARRAY [1:class_size]
OF RECORD
student_name ARRAY [1:25] OF CHAR;
student_major ARRAY [1:10] OF CHAR;
student_grade CHAR;
END;
END;

This table is not in the most basic normal form of relational databases. First Normal Form (1NF) means that the table has no repeating groups. That is, every column is a scalar value, not an array or a list or anything with its own structure.

In SQL, it is impossible not to be in 1NF unless the vendor has added array or other extensions to the language. The Pascal record could be “flattened out” in SQL and the field names changed to data element names to look like this:

CREATE TABLE Classes
(course_name CHAR(7) NOT NULL,
class_section CHAR(1) NOT NULL,
time_period INTEGER NOT NULL,
room_nbr INTEGER NOT NULL,
room_size INTEGER NOT NULL,
professor_name CHAR(25) NOT NULL,
dept_name CHAR(10) NOT NULL,
student_name CHAR(25) NOT NULL,
student_major CHAR(10) NOT NULL,
student_grade CHAR(1) NOT NULL);

This table is acceptable to SQL. In fact, we can locate a row in the table with a combination of (course_name, class_section, student_name), so we have a key. But what we are doing is hiding the Students record array, which has not changed its nature by being flattened.

There are problems.

If Professor ‘Jones’ of the math department dies, we delete all his rows from the Classes table. This also deletes the information that all his students were taking a math class and maybe not all of them wanted to drop out of school just yet. I am deleting more than one fact from the database. This is called a deletion anomaly.

If student ‘Wilson’ decides to change one of his math classes, formerly taught by Professor ‘Jones,’ to English, we will show Professor ‘Jones’ as an instructor in both the math and the English departments. I could not change a simple fact by itself. This creates false information and is called an update anomaly.

If the school decides to start a new department, which has no students yet, we cannot put in the data about the professor we just hired until we have classroom and student data to fill out a row. I cannot insert a simple fact by itself. This is called an insertion anomaly.

There are more problems in this table, but you see the point. Yes, there are some ways to get around these problems without changing the tables. We could permit NULLs in the table. We could write triggers to check the table for false data. These are tricks that will only get worse as the data and the relationships become more complex. The solution is to break the table up into other tables, each of which represents one relationship or simple fact.

5.2.1 Note on Repeating Groups

The definition of 1NF is that the table has no repeating groups and that all columns are scalar values. This means a column cannot have arrays, linked lists, tables within tables, or record structures, like those you find in other programming languages. This was very easy to avoid in Standard SQL, since the language had no support for them. This is no longer true after SQL-99, which introduces several very nonrelational “features” and since several vendors added their own support for arrays, nested tables, and variant data types.

Aside from relational purity, there are good reasons to avoid these features. They are not widely implemented, and the vendor specific extensions will not port. Furthermore, the optimizers cannot easily use them, so they degrade performance.

Old habits are hard to change, so new SQL programmers often try to force their old model of the world into Standard SQL in several ways.

5.2.1.1 Repeating Columns

One way you “fake it” in SQL is to use a group of columns which all the members of the group have the same semantic value; that is, they represent the same attribute in the table. Consider the table of an employee and his children:

CREATE TABLE Personnel
(emp_nbr INTEGER NOT NULL PRIMARY KEY,
emp_name VARCHAR(30) NOT NULL,
. . .
child1 CHAR(30), birthday1 DATE, sex1 CHAR(1),
child2 CHAR(30), birthday2 DATE, sex2 CHAR(1),
child3 CHAR(30), birthday3 DATE, sex3 CHAR(1),
child4 CHAR(30), birthday4 DATE, sex4 CHAR(1));

This looks like the layouts of many existing file system records in COBOL and other 3GL languages. The birthday and sex information for each child is part of a repeated group and therefore violates 1NF. This is faking a four-element array in SQL; the index just happens to be part of the column name!

Suppose I have a table with the quantity of a product_name sold in each month of a particular year and I originally built the table to look like this:

CREATE TABLE Abnormal
(product_name CHAR(10) NOT NULL PRIMARY KEY,
month_01 INTEGER,  --  null means no data yet
month_02 INTEGER,
. . .
month_12 INTEGER);

and I wanted to flatten it out into a more normalized form, like this:

CREATE TABLE Normal
(product_name CHAR(10) NOT NULL,
month_nbr INTEGER NOT NULL,
product_qty INTEGER NOT NULL,
PRIMARY KEY (product_name, month_nbr));

I can use the statement

INSERT INTO Normal (product_name, month_nbr, product_qty)
SELECT product_name, 1, month_01
FROM Abnormal
WHERE month_01 IS NOT NULL
UNION ALL
SELECT product_name, 2, month_02
FROM Abnormal
WHERE month_02 IS NOT NULL
. . .
UNION ALL
SELECT product_name, 12, month_12
FROM Abnormal
WHERE bin_12 IS NOT NULL;

While a UNION ALL expression is usually slow, this has to be run only once to load the normalized table and then the original table can be dropped.

5.2.1.2 Parsing a List in a String

Another popular method is to use a string and fill it with a comma-separated list. The result is a lot of string handling procedures to work around this kludge. Consider this example:

CREATE TABLE InputStrings
(key_col CHAR(10) NOT NULL PRIMARY KEY,
input_string VARCHAR(255) NOT NULL);
INSERT INTO InputStrings VALUES ('first', '12,34,567,896'),
INSERT INTO InputStrings VALUES ('second', '312,534,997,896'), . . .

This will be the table that gets the outputs, in the form of the original key column and one parameter per row.

CREATE TABLE Parmlist
(key_col CHAR(5) NOT NULL PRIMARY KEY,
parm INTEGER NOT NULL);

It makes life easier if the lists in the input strings start and end with a comma. You will also need a table called Series, which is a set of integers from 1 to (n).

SELECT key_col,
CAST (SUBSTRING (',' || I1.input_string || ',', MAX(S1.seq || 1),
(S2.seq - MAX(S1.seq || 1)))
AS INTEGER),
COUNT(S2.seq) AS place
FROM InputStrings AS I1, Series AS S1, Series AS S2
WHERE SUBSTRING (',' || I1.input_string || ',', S1.seq, 1) = ','
AND SUBSTRING (',' || I1.input_string || ',', S2.seq, 1) = ','
AND S1.seq <  S2.seq
AND S2.seq <=  DATALENGTH(I1.input_string) + 1
GROUP BY I1.key_col, I1.input_string, S2.seq;

The S1 and S2 copies of Series are used to locate bracketing pairs of commas, and the entire set of substrings located between them is extracts and cast as integers in one nonprocedural step.

The trick is to be sure that the left-hand comma of the bracketing pair is the closest one to the second comma. The place column tells you the relative position of the value in the input string.

A very fast version of this trick is due to Ken Henderson. Instead of using a comma to separate the fields within the list, put each value into a fixed length substring and extract them by using a simple multiplication of the length by the desired array index number. This is a direct imitation of how many compilers handle arrays at the hardware level.

Having said all of this, the right way would be to put the list into a single column in a table. This can be done in languages that allow you to pass array elements into SQL parameters, like this:

INSERT INTO Parmlist
VALUES (:a[1]), (:a[2]), (:a[3]), .., (:a[n]);

Or if you want to remove NULLs and duplicates

INSERT INTO Parmlist
SELECT DISTINCT x
FROM VALUES (:a[1]), (:a[2]), (:a[3]), .., (:a[n]) AS List(x)
WHERE x IS NOT NULL;

5.3 Second Normal Form (2NF)

A table is in Second Normal Form (2NF) if it is in 1NF and has no partial key dependencies. That is, if X and Y are columns and X is a key, then for any Z that is a proper subset of X, it cannot be the case that Z → Y. Informally, the table is in 1NF and it has a key that determines all nonkey attributes in the table.

In the Pascal example, our users tell us that knowing the student and course_name is sufficient to determine the class_section (since students cannot sign up for more than one class_section of the same course_name) and the student_grade. This is the same as saying that (student_name, course_name) → (class_section, student_grade).

After more analysis, we also discover from our users that (student_name → student_major)—students have only one student_major. Since student is part of the (student_name, course_name) key, we have a partial key dependency! This leads us to the following decomposition:

CREATE TABLE Classes
(course_name CHAR(7) NOT NULL,
class_section CHAR(1) NOT NULL,
time_period INTEGER NOT NULL,
room_nbr INTEGER NOT NULL,
room_size INTEGER NOT NULL,
professor_name CHAR(25) NOT NULL,
PRIMARY KEY (course_name, class_section));
CREATE TABLE Enrollment
(student_name CHAR(25) NOT NULL,
course_name CHAR(7) NOT NULL,
class_section CHAR(1) NOT NULL,
student_grade CHAR(1) NOT NULL,
PRIMARY KEY (student_name, course_name));
CREATE TABLE Students
(student_name CHAR(25) NOT NULL PRIMARY KEY,
student_major CHAR(10) NOT NULL);

At this point, we are in 2NF. Every attribute depends on the entire key in its table. Now if a student changes majors, it can be done in one place. Furthermore, a student cannot sign up for different sections of the same class, because we have changed the key of Enrollment. Unfortunately, we still have problems.

Notice that while room_size depends on the entire key of Classes, it also depends on room_nbr. If the room_nbr is changed for a course_name and class_section, we may also have to change the room_size, and if the room_nbr is modified (we knock down a wall), we may have to change room_size in several rows in Classes for that room_nbr.

5.4 Third Normal Form (3NF)

Another normal form can address these problems. A table is in Third Normal Form (3NF) if it is in 2NF and for all X → Y, where X and Y are columns of a table, X is a key or Y is part of a candidate key. (A candidate key is a unique set of columns that identify each row in a table; you cannot remove a column from the candidate key without destroying its uniqueness.) This implies that the table is in 2NF, since a partial key dependency is a type of transitive dependency. Informally, all the nonkey columns are determined by the key, the whole key, and nothing but the key.

The usual way that 3NF is explained is that there are no transitive dependencies, but this is not quite right. A transitive dependency is a situation where we have a table with columns (A, B, C) and (A → B) and (B → C), so we know that (A → C). In our case, the situation is that (course_name, class_section) → room_nbr and room_nbr → room_size. This is not a simple transitive dependency, since only part of a key is involved, but the principle still holds. To get our example into 3NF and fix the problem with the room_size column, we make the following decomposition:

CREATE TABLE Rooms
(room_nbr INTEGER NOT NULL PRIMARY KEY,
room_size INTEGER NOT NULL);
CREATE TABLE Students
(student_name CHAR(25) NOT NULL PRIMARY KEY,
student_major CHAR(10) NOT NULL);
CREATE TABLE Classes
(course_name CHAR(7) NOT NULL,
class_section CHAR(1) NOT NULL,
PRIMARY KEY (course_name, class_section),
time_period INTEGER NOT NULL,
room_nbr INTEGER NOT NULL
REFERENCES Rooms(room_nbr));
CREATE TABLE Enrollment
(student_name CHAR(25) NOT NULL
REFERENCES Students(student_name),
course_name CHAR(7) NOT NULL,
PRIMARY KEY (student_name, course_name),
class_section CHAR(1) NOT NULL,
FOREIGN KEY (course_name, class_section)
REFERENCES Classes(course_name, class_section),
student_grade CHAR(1) NOT NULL);

A common misunderstanding about relational theory is that 3NF tables have no transitive dependencies. As indicated above, if X → Y, X does not have to be a key if Y is part of a candidate key. We still have a transitive dependency in the example—(room_nbr, time_period) → (course_name, class_section)—but since the right side of the dependency is a key, it is technically in 3NF. The unreasonable behavior that this table structure still has is that several course_names can be assigned to the same room_nbr at the same time.

Another form of transitive dependency is a computed column. For example:

CREATE TABLE Boxes
(width INTEGER NOT NULL,
length INTEGER NOT NULL,
height INTEGER NOT NULL,
volume INTEGER NOT NULL
CHECK (width * length * height = volume),
PRIMARY KEY (width, length, height));

The volume column is determined by the other three columns, so any change to one of the three columns will require a change to the volume column. You can use a computed column in this example which would look like:

(volume INTEGER COMPUTED AS (width * length * height) PERSISTENT)

5.5 Elementary Key Normal Form (EKNF)

Elementary Key Normal Form (EKNF) is a subtle enhancement on 3NF. By definition, EKNF tables are also in 3NF. This happens when there is more than one unique composite key and they overlap. Such cases can cause redundant information in the overlapping column(s). For example, in the following table, let’s assume that a course code is also a unique identifier for a given subject in the following table:

CREATE TABLE Enrollment
(student_id INTEGER NOT NULL,
course_code CHAR(6) NOT NULL,
course_name VARCHAR(15) NOT NULL,
PRIMARY KEY (student_id, course_name)
-- , UNIQUE (student_id, course_code) alternative key
);
Enrollment
student_id course_code course_name
=======================================
1 'CS-100' 'ER Diagrams'
1 'CS-114' 'Database Design'
2 'CS-114' 'Database Design'

This table, although it is in 3NF, violates EKNF. The primary key of the above table is the combination of (student_id, course_name). However, we can also see an alternate key (student_id, course_code) as well. The above schema could result in update and deletion anomalies because values of both course_name and course_code tend to be repeated for a given subject.

The following schema is a decomposition of the above table in order to satisfy EKNF:

CREATE TABLE Subjects
(course_code CHAR(6) NOT NULL PRIMARY KEY,
course_name VARCHAR(15) NOT NULL);
CREATE TABLE Enrollment
(student_id INTEGER NOT NULL
REFERENCES Students(student_id),
course_code CHAR(6) NOT NULL
REFERENCES Subjects(course_code),
PRIMARY KEY (student_id, course_code));

For reasons that will become obvious in the following class_section, ensuring that a table is in EKNF is usually skipped, as most designers will move directly on to Boyce-Codd Normal Form after ensuring that a schema is in 3NF. Thus, EKNF is included here only for reasons of historical accuracy and completeness.

5.6 Boyce-Codd Normal Form (BCNF)

A table is in BCNF when for all nontrivial FDs (X → A), X is a superkey for the whole schema. A superkey is a unique set of columns that identify each row in a table, but you can remove some columns from it and it will still be a key. Informally, a superkey is carrying extra weight.

BCNF is the normal form that actually removes all transitive dependencies. A table is in BCNF if for all (X → Y), X is a key period. We can go to this normal form just by adding another key with UNIQUE (room_nbr, time_period) constraint clause to the table Classes.

There are some other interesting and useful “higher” normal forms, but they are outside of the scope of this discussion. In our example, we have removed all of the important anomalies with BCNF.

Third Normal Form was concerned with the relationship between key and nonkey columns. However, a column can often play both roles. Consider a table for computing each salesman’s bonus gifts that has for each salesman his base salary, the number of gift_points he has won in a contest, and the bonus gift awarded for that combination of salary range and gift_points. For example, we might give a fountain pen to a beginning salesman with a base pay rate between $15,000.00 and $20,000.00 and 100 gift_points, but give a car to a master salesman, whose salary is between $30,000.00 and $60,000.00 and who has 200 gift_points. The functional dependencies are, therefore,

(pay_step, gift_points) → gift_name

gift_name → gift_points

Let’s start with a table that has all the data in it and normalize it.

Gifts
salary_amt gift_points gift_name
=================================
15000.00 100 'Pencil'
17000.00 100 'Pen'
30000.00 200 'Car'
31000.00 200 'Car'
32000.00 200 'Car'
CREATE TABLE Gifts
(salary_amt DECIMAL(8,2) NOT NULL
gift_points INTEGER NOT NULL,
PRIMARY KEY (salary_amt, gift_points),
gift_name VARCHAR(10) NOT NULL);

This schema is in 3NF, but it has problems. You cannot insert a new gift into our offerings and points unless we have a salary to go with it. If you remove any sales points, you lose information about the gifts and salaries (e.g., only people in the $30,000.00 to $32,000.00 range can win a car). And, finally, a change in the gifts for a particular point score would have to affect all the rows within the same pay step. This table needs to be broken apart into two tables:

PayGifts
salary_amt gift_name
=====================
15000.00 'Pencil'
17000.00 'Pen'
30000.00 'Car'
31000.00 'Car'
32000.00 'Car'
CREATE TABLE Gifts
(salary_amt DECIMAL(8,2) NOT NULL,
gift_points INTEGER NOT NULL,
PRIMARY KEY(salary_amt, gift_points),
gift_name VARCHAR(10) NOT NULL);
GiftsPoints
gift_name gift_points
======================
'Pencil' 100
'Pen' 100
'Car' 200
(salary_amt, gift_points) → gift
gift → gift_points
CREATE TABLE GiftsPoints
(gift_name VARCHAR(10) NOT NULL PRIMARY KEY,
gift_points INTEGER NOT NULL));

5.7 Fourth Normal Form (4NF)

Fourth Normal Form (4NF) makes use of multivalued dependencies. The problem it solves is that the table has too many of them. For example, consider a table of departments, their projects, and the parts they stock. The MVD’s in the table would be:

dept_namearrow jobs
dept_namearrow parts

Assume that dept_name 'd1' works on jobs 'j1,' and 'j2' with parts 'p1' and 'p2'; that dept_name 'd2' works on jobs 'j3,' 'j4,' and 'j5' with parts 'p2' and 'p4'; and that dept_name 'd3' works on job 'j2' only with parts 'p5' and 'p6.' The table would look like this:

deptjobpart
'd1''j1''p1'
'd1''j1''p2'
'd1''j2''p1'
'd1''j2''p2'
'd2''j3''p2'
'd2''j3''p4'
'd2''j4''p2'
'd2''j4''p4'
'd2''j5''p2'
'd2''j5''p4'
'd3''j2''p5'
'd3''j2''p6'

If you want to add a part to a dept_name, you must create more than one new row.

Likewise, to remove a part or a job from a row can destroy information. Updating a part or job name will also require multiple rows to be changed.

The solution is to split this table into two tables, one with (dept_name, jobs) in it and one with (dept_name, parts) in it. The definition of 4NF is that we have no more than one MVD in a table. If a table is in 4NF, it is also in BCNF.

5.8 Fifth Normal Form (5NF)

Fifth Normal Form (5NF), also called the Join-Projection Normal Form or the Projection-Join Normal Form, is based on the idea of a lossless JOIN or the lack of a join-projection anomaly. This problem occurs when you have an n-way relationship, where (n > 2). A quick check for 5NF is to see if the table is in 3NF, and all the candidate keys are single columns.

As an example of the problems solved by 5NF, consider a table of house notes that records the buyer, the seller, and the lender:

HouseNotes

buyersellerlender
'Smith''Jones''NationalBank'
'Smith''Wilson''HomeBank'
'Nelson''Jones''HomeBank'

This table is a three-way relationship, but because older CASE tools allow only binary relationships it might have to be expressed in an E-R diagram as three binary relationships, which would generate CREATE TABLE statements leading to these tables:

BuyerLender

buyerlender
'Smith''NationalBank'
'Smith''HomeBank'
'Nelson''HomeBank'

SellerLender

sellerlender
'Jones''NationalBank'
'Wilson''HomeBank'
'Jones''HomeBank'

BuyerSeller

buyerseller
'Smith''Jones'
'Smith''Wilson'
'Nelson''Jones'

The trouble is that when you try to assemble the original information by joining pairs of these three tables together, thus:

SELECTBS.buyer, SL.seller, BL.lender
FROM BuyerLender AS BL,
SellerLender AS SL,
BuyerSeller AS BS
  WHEREBL.buyer = BS.buyer
ANDBL.lender = SL.lender
ANDSL.seller = BS.seller;

you will recreate all the valid rows in the original table, such as (‘Smith,’ ‘Jones,’ ‘National Bank’), but there will also be false rows, such as (‘Smith,’ ‘Jones,’ ‘Home Bank’), which were not part of the original table. This is called a join-projection anomaly.

There are also strong JPNF and overstrong JPNF, which make use of JOIN dependencies (JD). Unfortunately, there is no systematic way to find a JPNF or 4NF schema, because the problem is known to be NP complete. This is a mathematical term that means as the number of elements in a problem increase, the effort to solve it increases so fast and requires so many resources that you cannot find a general answer.

As an aside, Third Normal Form is very popular with CASE tools and most of them can generate a schema where all of the tables are in 3NF. They obtain the FDs from an E-R (entity-relationship) diagram or from a statistical analysis of the existing data, then put them together into tables and check for normal forms.

The bad news is that it is often possible to derive more than one 3NF schema from a set of FDs. Most of CASE tools that produce an E-R diagram will find only one of them, and go no further. However, if you use an ORM (Object Role Model) tool properly, the schema will be in 5NF. I suggest strongly that you get any of the books by Terry Halpin on this technique.

5.9 Domain-Key Normal Form (DKNF)

Ronald Fagin defined Domain/Key Normal Form (DKNF) in 1981 as a schema having all of the domain constraints and functional dependencies enforced. There is not yet a general algorithm that will always generate the DKNF solution given a set of constraints. We can, however, determine DKNF in many special cases, and it is a good guide to writing DDL in the real world.

Let’s back up a bit and look at the mathematical model under normalization. A functional dependency has axioms that can be used in normalization problems. These six axioms, known as Armstrong’s axioms, are given below:

Reflexive: X → X

Augmentation: if X → Y, then XZ → Y

Union: if (X → Y and X → Z) then X → YZ

Decomposition: if X → Y and Z a subset of Y, then X → Z

Transitivity: if (X → Y and Y → Z) then X → Z

Pseudo-transitivity: if (X → Y and YZ → W) then XZ → W

They make good sense if you just look at them, which is something we like in a set of axioms. In the real world, the FDs are the business rules we are trying to model.

In the normalization algorithm for 3NF (developed by P. A. Berstein, 1976), we use the axioms to get rid of redundant FD’s. For example, if we are given:

A → B
A → C
B → C
DB → E
DAF → E

A → C is redundant because it can be derived from A → B and B → C with transitivity. Also DAF → E is redundant because it can be derived from DB → E and A → B with transitivity (which gives us DA → E) and augmentation (which then allows DAF → E). What we would like to find is the smallest set of FDs from which we can generate all of the given rules. This is called a nonredundant cover. For the FD’s above, one cover would be:

A → B
B → C
DB → E

Once we do this Berstein shows that we can just create a table for each of the FD’s where A, B, and DB are the respective keys. We have taken it easy so far but now it’s time for a challenge.

As an example of a schema with multiple Third Normal Form (3NF) tables, here is a problem that was used in a demonstration by DBStar Corporation (now Evoke Software). The company used it as an example in a demonstration that comes with their CASE tool.

We are given an imaginary and simplified airline that has a database for scheduling flights and pilots. Most of the relationships are obvious things. Flights have only one departure time and one destination. They can get a different pilot and can be assigned to a different gate each day of the week. The functional dependencies for the database are given below:

1) flight → destination
2) flight → hour
3) (day, flight) → gate
4) (day, flight) → pilot
5) (day, hour, pilot) → gate
6) (day, hour, pilot) → flight
7) (day, hour, pilot) → destination
8) (day, hour, gate) → pilot
9) (day, hour, gate) → flight
10) (day, hour, gate) → destination

A purist will look at this collection of FDs can be bothered by the redundancies in this list. But in the real world, when you interview people, they do not speak to you in a minimal set; they state things that they know to be true in their situation. In fact, they very often leave out relationships that they considered to be too obvious to mention.

Your problem is to find 3NF or stronger database schemas in these FD’s. You have to be careful! You have to have all of the columns, obviously, but your answer could be in 3NF and still ignore some of the FD’s. For example, this will not work:

CREATE TABLE PlannedSchedule
(flight, destination, hour, PRIMARY KEY (flight));
CREATE TABLE ActualSchedule
(day, flight, gate, pilot, PRIMARY KEY (day, flight));

If we apply the Union axiom to some of the FD’s, we get:

(day, hour, gate) → (destination, flight, pilot)

(day, hour, pilot) → (destination, flight, gate)

This says that the user has required that if we are given a day, an hour, and a gate, we should be able to determine a unique flight for that day, hour, and gate. We should also be able to determine a unique flight given a day, hour, and pilot.

Given the PlannedSchedule and ActualSchedule tables, you cannot produce views where either of the two constraints we just mentioned is enforced. If the query “What flight does pilot X have on day Y and hour Z?” gives you more than one answer, it violates the FD’s and common sense. Here is an example of a schema that is allowable in this proposed schema which is undesirable given our constraints:

PlannedSchedule

flight00:00destination
11805:00Dallas
12301:00Omaha
15505:00Los Angeles
17101:00New York
66601:00Atlanta

ActualSchedule

dayflightpilotgate
Wed118Tom12A
Wed155Tom13B
Wed171Tom12A
Thu123John12A
Thu155John12A
Thu171John13B

t0040

The constraints mean that we should be able to find a unique answer to each the following questions and not lose any information when inserting and deleting data.

(1) Which flight is leaving form gate 12A on Thursdays at 13:00 Hrs? This looks fine until you realize that you do not know about flight 666, which was not required to have anything about its day or pilot in the ActualSchedule table. And likewise, I can add a flight to the ActualSchedule table that has no information in the PlannedSchedule table.

(2) Which pilot is assigned to the flight that leaves gate 12A on Thursdays at 13:00 Hrs? This has the same problem as before.

(3) What is the destination of the flight in query 1 and 2? This has the same problem as before.

(4) What gate is John leaving from on Thursdays at 13:00 Hrs?

(5) Where is Tom flying to on Wednesdays at 17:00 Hrs?

(6) What flight is assigned to Tom on Wednesdays at 17:00 Hrs?

It might help if we gave an example of how one of the FD’s in the problem can be derived using the axioms of FD calculus, just like you would do a geometry proof:

Given:

1) (day, hour, gate) → pilot
2) (day, hour, pilot) → flight
prove that:
(day, hour, gate) → flight.
3) (day, hour) → (day, hour); Reflexive
4) (day, hour, gate) → (day, hour); Augmentation on 3
5) (day, hour, gate) → (day, hour, pilot); Union 1 & 4
6) (day, hour, gate) → flight; Transitive 2 and 5
Q.E.D

The answer is to start by attempting to derive each of the functional dependencies from the rest of the set. What we get is several short proofs, each requiring different “given” functional dependencies in order to get to the derived FD.

Here is a list of each of the proofs used to derive the 10 fragmented FD’s in the problem. With each derivation, we include every derivation step and the legal FD calculus operation that allows me to make that step. An additional operation that we include here which was not included in the axioms we listed earlier is left reduction. Left reduction says that if XX → Y then X → Y. The reason it was not included is that this is actually a theorem and not one of the basic axioms (side problem: can you derive left reduction?).

Prove: (day, hour, pilot) → gate
a) day → day; Reflexive
b) (day, hour, pilot) → day; Augmentation (a)
c) (day, hour, pilot) → (day, flight); Union (6, b)
d) (day, hour, pilot) → gate; Transitive (c, 3)
Q.E.D.
Prove: (day, hour, gate) → pilot
a) day → day; Reflexive
b) day, hour, gate → day; Augmentation (a)
c) day, hour, gate → (day, flight); Union (9, b)
d) day, hour, gate → pilot; Transitive (c, 4)
Q.E.D.
Prove: (day, flight) → gate
a) (day, flight, pilot) → gate; Pseudotransitivity (2, 5)
b) (day, flight, day, flight) → gate; Pseudotransitivity (a, 4)
c) (day, flight) → gate; Left reduction (b)
Q.E.D.
Prove: (day, flight) → pilot
a) (day, flight, gate) → pilot; Pseudotransitivity (2, 8)
b) (day, flight, day, flight) → pilot; Pseudotransitivity (a, 3)
c) (day, flight) → pilot; Left reduction (b)
Q.E.D.
Prove: (day, hour, gate) → flight
a) (day, hour) → (day, hour); Reflexivity
b) (day, hour, gate) → (day, hour); Augmentation (a)
c) (day, hour, gate) → (day, hour, pilot); Union (b, 8)
d) (day, hour, gate) → flight; Transitivity (c, 6)
Q.E.D.
Prove: (day, hour, pilot) → flight
a) (day, hour) → (day, hour); Reflexivity
b) (day, hour, pilot) → (day, hour); Augmentation (a)
c) (day, hour, pilot) → day, hour, gate; Union (b, 5)
d) (day, hour, pilot) → flight; Transitivity (c, 9)
Q.E.D.
Prove: (day, hour, gate) → destination
a) (day, hour, gate) → destination; Transitivity (9, 1)
Q.E.D.
Prove: (day, hour, pilot) → destination
a) (day, hour, pilot) → destination; Transitivity (6, 1)
Q.E.D.

Now that we’ve shown you how to derive 8 of the 10 FD’s from other FD’s, you can try mixing and matching the FD’s into sets so that each set meets the following criteria:

(1) Each attribute must be represented on either the left or the right side of at least one FD in the set.

(2) If a given FD is included in the set then all the FD’s needed to derive it cannot also be included.

(3) If a given FD is excluded from the set then the FD’s used to derive it must be included.

This produces a set of “nonredundant covers,” which can be found with trial, error, and common sense. For example, if we excluded (day, hour, gate) → flight, we must then include (day, hour, gate) → pilot and vice versa because each is used in the others derivation. If you want to be sure your search was exhaustive, however, you may want to apply a more mechanical method, which is what the CASE tools do for you.

The algorithm for accomplishing this task is basically to generate all the combinations of sets of the FD’s. (flight → destination) and (flight → hour) are excluded in the combination generation because they cannot be derived. This gives us (28) or 256 combinations of FD’s. Each combination is then tested against the criteria.

Fortunately, a simple spreadsheet does all the tedious work. In this problem, the criteria #1 eliminates only 15 sets. Then a criterion #2 eliminates 152 sets, and a criterion #3 drops another 67. This leaves us with 22 possible covers, five of which are the answers we are looking for (we will explain the other 17 later).

These five nonredundant covers are:

Set I:
flight → destination
flight → hour
(day, hour, gate) → flight
(day, hour, gate) → pilot
(day, hour, pilot) → gate
Set II:
flight → destination
flight → hour
(day, hour, gate) → pilot
(day, hour, pilot) → flight
(day, hour, pilot) → gate
Set III:
flight → destination
flight → hour
(day, flight) → gate
(day, flight) → pilot
(day, hour, gate) → flight
Set IV:
flight → destination
flight → hour
(day, flight) → gate
(day, hour, gate) → pilot
(day, hour, pilot) → flight
Set V:
flight → destination
flight → hour
(day, flight) → pilot
(day, hour, gate) → flight
(day, hour, pilot) → gate
(day, hour, pilot) → flight

At this point, we perform unions on FD’s with the same left-hand side and make tables for each grouping with the left-hand side as a key. We can also eliminate symmetrical FD’s (defined as X → Y and Y → X, and written with a two headed arrow, X ↔ Y) by collapsing them into the same table.

These possible schemas are in at least 3NF. They are given in shorthand SQL DDL (Data Declaration Language) without data type declarations.

Solution 1:
CREATE TABLE R1 (flight, destination, hour,
PRIMARY KEY (flight));
CREATE TABLE R2 (day, hour, gate, flight, pilot,
PRIMARY KEY (day, hour, gate),
UNIQUE (day, hour, pilot),
UNIQUE (day, flight),
UNIQUE (flight, hour));
Solution 2:
CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY
(flight));
CREATE TABLE R2 (day, flight, gate, pilot,
PRIMARY KEY (day, flight));
CREATE TABLE R3 (day, hour, gate, flight,
PRIMARY KEY (day, hour, gate),
UNIQUE (day, flight),
UNIQUE (flights, hour));
CREATE TABLE R4 (day, hour, pilot, flight,
PRIMARY KEY (day, hour, pilot));
Solution 3:
CREATE TABLE R1 (flight, destination, hour, flight
PRIMARY KEY (flight));
CREATE TABLE R2 (day, flight, gate, PRIMARY KEY (day, flight));
CREATE TABLE R3 (day, hour, gate, pilot,
PRIMARY KEY (day, hour, gate),
UNIQUE (day, hour, pilot),
UNIQUE (day, hour, gate));
CREATE TABLE R4 (day, hour, pilot, flight
PRIMARY KEY (day, hour, pilot),
UNIQUE(day, flight),
UNIQUE (flight, hour));
Solution 4:
CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY (flight));
CREATE TABLE R2 (day, flight, pilot, PRIMARY KEY (day, flight));
CREATE TABLE R3 (day, hour, gate, flight,
PRIMARY KEY (day, hour, gate),
UNIQUE (flight, hour));
CREATE TABLE R4 (day, hour, pilot, gate,
PRIMARY KEY (day, hour, pilot));

Once you look at these solutions, they are a mess, but they are a Third Normal Form mess! Is there a better answer? Here is one in BCNF and only two tables, proposed by Chris Date (RELATIONAL DATABASE WRITINGS 1991-1994; ISBN 0-201-82459-0; pp. 224).

CREATE TABLE DailySchedules (flight, destination, hour PRIMARY KEY (flight));

CREATE TABLE PilotSchedules (day, flight, gate, pilot, PRIMARY KEY (day, flight));

This is a workable schema. But we could expand the constraints to give us better performance and more precise error messages, since schedules are not likely to change:

CREATE TABLE DailySchedules
(flight, hour, destination,
UNIQUE (flight, hour, destination),
UNIQUE (flight, hour),
UNIQUE (flight));
CREATE TABLE PilotSchedules
(day, flight, day, hour, gate, pilot,
UNIQUE (day, flight, gate),
UNIQUE (day, flight, pilot),
UNIQUE (day, flight),
FOREIGN KEY (flight, hour) REFERENCES R1(flight, hour));

5.10 Practical Hints for Normalization

CASE tools implement formal methods for doing normalization. In particular, E-R (Entity-Relationship) diagrams are very useful for this. However, a few informal hints can help speed up the process and give you a good start.

Broadly speaking, tables represent either entities, relationships or they are auxiliary tables. This is why E-R diagrams work so well as a design tool. The auxiliary tables do not show up on the diagrams, since they are functions, translations, and look-ups that support a declarative computational model.

The tables that represent entities should have a simple, immediate name suggested by their contents—a table named Students has student data in it, not student data and their bowling scores. It is also a good idea to use plural or collective nouns as the names of such tables to remind you that a table is a set of entities; the rows are the single instances of them.

Tables which represent one to many relationships should be named by their contents and should be as minimal as possible. For example, Students are related to Courses by a third (relationship) table for their attendance. These tables might represent a pure relationship or they might contain attributes that exist within the relationship, such as a student_grade for the class attended. Since the only way to get a student_grade is to attend the class, the relationship is going to have a compound key made up of references to the entity keys. We will probably name it “ReportCards,” “Grades,” or something similar. Avoid naming entities based on M:M relationships by combining the two table names. For example, “Students_Courses” is an easy but really bad name for the “Enrollment” entity.

AvoiNULLs whenever possible. If a table has too many NULL-able columns, it is probably not normalized properly. Try to use a NULL only for a value which is missing now, but which will be resolved later. Even better, put missing values into the encoding schemes for that column. I have a whole book on this topic, SQL PROGRAMMING STYLE (ISBN 978-0120887972) and mention in other books.

As a gross generalization, normalized databases will tend to have a lot of tables with a small number of columns per table. Do not panic when you see that happen. People who first worked with file systems (particularly on computers that used magnetic tape) tend to design one monster file for an application and do all the work against its records. This made sense in the old days, since there was no reasonable way to JOIN a number of small files together without having the computer operator mount and dismount lots of different magnetic tapes. The habit of designing this way carried over to disk systems, since the procedural programming languages were still the same for the databases as they had been for the sequential file systems.

The same nonkey attribute in more than one table is probably a normalization problem. This is not a certainty, just a guideline. The key that determines that attribute should be in only one table, and therefore its attributes should be with it. The key attributes will be referenced by related tables.

As a practical matter, you are apt to see the same attribute under different names and need to make the names uniform in the entire database. The columns “date_of_birth,” “birthdate,” “birthday,” and “dob” are very likely the same attribute of an employee. You now have the ISO-11179 for naming guidelines, as discussed in SQL PROGRAMMING STYLE (Morgan Kaufmann, May 1, 2005; ISBN: 978-0120887972).

5.11 Non-Normal Form Redundancy

Department of Redundancy Department

Monty Python’s Flying Circus

The goal of databases, not just Relational Databases, was to remove redundancy in the data. When we used punch cards and magnetic tapes, the data were always being sorted and duplicated so that batch jobs for various departments could be done in parallel.

But any error in one of these data silos means that they could not be summarize together, thrown out of balance a single typing error, timing problems, and other things. I summarized the goal of database as “one fact, one way, one time, one place” in one of my famous quotes.

Normalization prevents some redundancy in a table. But not all redundancy is based on Normal Forms. In Section 5.4, we saw how a computed column could be used to replace a base column when the base column is a computation.

5.11.1 Aggregation Level Redundancy

A common example is the “Invoices” and “Invoice_Details” idiom which puts detail summary data in the order header. This is usually a column for “invoice_total” which has to be re-computed when a order item changes. What has happened is a confusion in levels of aggregation.

CREATE TABLE Invoices
(invoice_nbr CHAR(15) NOT NULL PRIMARY KEY,
customer_name VARCHAR(35) NOT NULL,
invoice_terms CHAR(7) NOT NULL
CHECK (invoice_terms IN ('cash', 'credit', 'coupon)),
invoice_amt_tot DECIMAL(12,2) NOT NULL);
CREATE TABLE Invoice_Details
(invoice_nbr CHAR(15) NOT NULL
REFERENCES Invoices (invoice_nbr)
ON DELETE CASCADE,
line_nbr INTEGER NOT NULL
CHECK (line_nbr > 0),
item_gtin CHAR(15) NOT NULL,
-- PRIMARY KEY (invoice_nbr, line_nbr),
-- PRIMARY KEY (invoice_nbr, item_gtin),
invoice:qty INTEGER NOT NULL
CHECK (invoice:qty > 0),
unit_price DECIMAL(12,2) NOT NULL)

There is redundancy in line_ nbr and item_gtin as components in a key. The invoice line numbers are physical locations on paper forms or a screen. A line number lets you place one product (item_gtin) in several places on the order form. Line numbers are not part of a logical data model.

But did you notice that Invoices.invoice_amt_tot = SUM (Invoice_Details.invoice_qty * Invoice_Details.unit_price)?

5.11.2 Entire Table Redundancy

Entire tables can be redundant. This often happens when there are two different ways to identify the same entity.

CREATE TABLE Map
(location_id CHAR(15) NOT NULL PRIMARY KEY,
location_name VARCHAR(35) NOT NULL,
location_longitude DECIMAL(9,5) NOT NULL),
location_latitude DECIMAL(9,5) NOT NULL);

location_id is the key, This might be a HTM (Hierarchical Triangular Mesh) number or a SAN (Standard Address Number, used in the Book and other industries). I can use a formula to compute the distance between two locations with this table. But I can also build a table of straight line distances directly:

CREATE TABLE Paths
(origin_loc CHAR(15) NOT NULL,
dest_loc CHAR(15) NOT NULL,
straight_line_dist DECIMAL(10,3) NOT NULL,
PRIMARY KEY (origin_loc, dest_loc));

This is an actual case from Tom Johnston. The (longitude, latitude) coordinate pairs would get out of alignment with the distance computations because they were maintained by two different people. The solution was VIEW to construct Paths when needed.

5.11.3 Access Path Redundancy

A more subtle redundancy is in the roles an entity plays in a data model. Try this example: A sales team is responsible for every customer that a member of that team (a salesperson) is assigned to and not responsible for any other customer. Here is the ER diagram

Now look at the redundant relationship. We have options.

1. Eliminate the redundancy: remove the is-responsible-for relationship from Sales-Team to Customer. The model is just as expressive as it was before the redundancy was eliminated.

2. Control the redundancy: add DRI actions. Because the two foreign keys that must be kept synchronized are in the same row, only one update is required.

Here is the DDL for the possible solutions:

CREATE TABLE Sales_Teams
(sales_team_id INTEGER NOT NULL PRIMARY KEY,
sales_team_name CHAR(10) NOT NULL);
CREATE TABLE Salespersons
(sales_person_id INTEGER NOT NULL PRIMARY KEY,
sales_person_name CHAR(15) NOT NULL,
sales_team_id INTEGER NOT NULL
REFERENCES Sales_Teams(sales_team_id)
ON UPDATE CASCADE);
CREATE TABLE Customers
(customer_id INTEGER NOT NULL PRIMARY KEY,
sales_team_id INTEGER NOT NULL
REFERENCES SalesPerson(sales_team_id)
ON UPDATE CASCADE,
sales_person_id INTEGER
REFERENCES Salespersons(sales_person_id)
ON UPDATE CASCADE
ON DELETE SET NULL);

Another possible schema:

CREATE TABLE Sales_Teams
(sales_team_id INTEGER NOT NULL PRIMARY KEY,
sales_team_name CHAR(10) NOT NULL);
CREATE TABLE Salespersons
(sales_person_id INTEGER NOT NULL PRIMARY KEY,
sales_person_name CHAR(15) NOT NULL,
sales_team_id INTEGER NOT NULL
REFERENCES Sales_Teams(sales_team_id)
ON UPDATE CASCADE,
UNIQUE (sales_person_id, sales_team_id));
CREATE TABLE Customers
(customer_id INTEGER NOT NULL PRIMARY KEY,
sales_team_id INTEGER NOT NULL,
sales_person_id INTEGER
REFERENCES Salespersons(sales_person_id)
ON UPDATE CASCADE ON DELETE SET NULL,
FOREIGN KEY (sales_person_id, sales_team_id)
REFERENCES Salespersons (sales_person_id, sales_team_id)
ON UPDATE CASCADE);

5.11.4 Attribute Splitting

Would you have a “Female_Personnel” and a “Male_Personnel” table in a schema? No, of course not! We need a “Personnel” table, not two tables constructed by using the values in a “sex_code” column as the splitter for table names. Making a table per calendar year (or month) is very common because it looks like how we did magnetic tapes. Another common split is a physical data source, such as each store in an enterprise.

Chris Date calls it “Orthogonal design” and I call it “Attribute Splitting,” but I use this for more general errors than just tables.

This is not disk partitioning! That is a physical vendor feature for accessing the data, but the table is still one logical unit in the schema.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.76.200