You won’t get too far into most database design books without reading a discussion on normalization and the Normal Forms. The Normal Forms are a series of forms, or table design specifications, that describe the best way to lay out data in a database. The higher the normal form, the more normalized the database is. Each form builds on the previous one, adding additional rules or conditions that must be met. Normalization is the process of removing data duplication, more clearly defining key relationships, and generally moving towards a more idealized database form. It is possible for different tables in the same database to be at different levels.
Most people recognize five normal forms simply referred to as the First Normal Form through the Fifth Normal Form. These are often abbreviated 1NF through 5NF. There are also a few named forms, such as the Boyce-Codd Normal Form (BCNF). Most of these other forms are roughly equivalent to one of the numbered forms. For example, BCNF is a slight extension to the Third Normal Form. Some folks also recognize higher levels of normalization, such as a Sixth Normal Form and beyond, but these extreme levels of normalization are well beyond the practical concerns of most database designers.
The normalization process is useful for two reasons. First, normalization specifies design criteria that can act as a guide in the design process. If you have a set of tables that are proving to be difficult to work with, that often points to a deeper design problem or assumption. The normalization process provides a set of rules and conditions that can help identify trouble spots, as well as provide possible solutions to reorganize the data in a more consistent and clean fashion.
The other advantage, which shows up more at
runtime, is that data integrity is much easier to enforce and maintain
in a normalized database. Although the overall database design is
often more complex (i.e., more tables), the individual parts are
usually much simpler and fill more clearly defined roles. This often
translates to better INSERT
,
UPDATE
, and DELETE
performance, since changes are
often smaller and more localized.
Localizing data is a core goal of data normalization. Most of the normal forms deal with eliminating redundant or replicated data so that each unique token of data is stored once—and only once—in the database. Everything else simply references that definitive copy. This makes updates easier, since there is only one place an update needs to be applied, but it also makes the data more consistent, as it is impossible for multiple copies of the data to become out of sync with each other. When working on a schema design, a question you should constantly ask yourself is, “If this piece of data changes, how many different places will I need to make that change?” If the answer is anything other than one, chances are you’re not in Normal Form.
Normalizing a database and spreading the data out into different tables means that queries usually involve joining several tables back together. This can occasionally lead to performance concerns, especially for complex reports that require data from a large number of tables. These concerns can sometimes lead to the process of denormalization, where duplicate copies of the same data are intentionally introduced to reduce the number of joins required for common queries. This is typically done on systems that are primarily read-only, such as data-warehouse databases, and is often done by computing temporary tables from properly normalized source data.
While a large number of joins can lead to performance concerns, database optimization is just like code optimization—don’t start too early and don’t make assumptions. In general, the advantages of normalization far outweigh the costs. A correct database that runs a tad slower is infinitely more useful than a very fast database that returns incorrect or inconsistent answers.
The First Normal Form, or 1NF, is the lowest level of normalization. It is primarily concerned with making sure a table is in the proper format. There are three conditions that must be met for a table to be in 1NF.
The first condition relates to ordering. To be in 1NF, the individual rows of a table cannot have any meaningful or inherent order. Each row should be an isolated, standalone record. The meaning of a value in one row cannot depend on any of the data values from neighboring rows, either by insertion order, or by some sorted order. This condition is usually easy to meet, as SQL does not guarantee any kind of row ordering.
The second condition is uniqueness. Every row
within a 1NF table must be unique, and must be unique by those columns
that hold meaningful data for the application. For example, if the
only difference between two rows is the database-maintained ROWID
column, then the rows aren’t
really unique. However, it is perfectly fine to consider an arbitrary
sequence ID (such as an INTEGER PRIMARY
KEY
) to be part of the application data. This
condition establishes that the table must have some type of PRIMARY KEY
, consisting of one or more
columns that creates a unique definition of what the table
represents.
The third and final condition for 1NF requires that every column of every row holds one (and only one) logical value that cannot be broken down any further. The concern is not with compound types, such as dates (which might be broken down into integer day, month, and year values) but with arrays or lists of logical values. For example, you shouldn’t be recording a text value that contains a comma-separated list of logical, independent values. Arrays or lists should be broken out into their own one-to-many relationships.
The Second Normal Form, or 2NF, deals with compound keys (multicolumn keys) and how other columns relate to such keys. 2NF has only one condition: every column that is not part of the primary key must be relevant to the primary key as a whole, and not just a sub-part of the key.
Consider a table that lists all the conference
rooms at a large corporate campus. At minimum, the conf_room
table has columns for
building_numb
and room_numb
. Taken together, these two
columns will uniquely identify any conference room across the whole
campus, so that will be our compound primary key.
Next, consider a column like seating_capacity
. The values in this
column are directly dependent on each specific conference room. That,
by definition, makes the column dependent on both the building number
and the room number. Including the seating_capacity
column will not break
2NF.
Now consider a column like building_address
. This column is
dependent on the building_numb
column, but it is not dependent on the
room_numb
column. Since
building_address
is dependent on only part of the
primary key, including this column in the conf_room
table would break 2NF.
Because 2NF is specifically concerned with multicolumn keys, any table with a single-column primary key that is in 1NF is automatically in 2NF.
To recognize a column that might be breaking 2NF,
look for columns that have duplicate values. If the duplicate values
tend to line up with duplicate values in one of the primary key
columns, that is a strong indication of a problem. For example, the
building_address
column will
have a number of duplicate values (assuming most buildings have more
than one conference room). The duplicate address values can be matched
to duplicate values in the building_numb
column. This alignment shows how the
address column is tied to only the building_numb
column specifically, and not the whole
primary key.
The Third Normal Form, or 3NF, extends the 2NF to eliminate transitive key dependencies. A transitive dependency is when A depends on B, and B depends on C, and therefore A depends on C. 3NF requires that each nonprimary key column has a direct (nontransitive) dependency on the primary key.
For example, consider an inventory database that
is used to track laptops at a small business. The laptop
table will have a primary key
that uniquely identifies each laptop, such as an inventory control
number. It is likely the table would have other columns that include
the make and model of the machine, the serial number, and perhaps a
purchase date. For our example, the
laptop
table will also
include a
responsible_
person_id
column. When an employee is
assigned a laptop, their employee ID number is put in this
column.
Within a row, the value of the responsible_person_id
column is
directly dependent on the primary key. In other words, each individual
laptop is assigned a specific responsible person, making the values in
the responsible_person_id
column
directly dependent on the primary key of the laptop
table.
Now consider what happens when we add a column
like responsible_person_email
. This
is a column that holds the email address of the responsible person.
The value of this column is still dependent on the primary key of the
laptop
table. Each individual
laptop has a specific responsible_person_email
field that is just as unique
as the responsible_person_id
field.
The problem is that the values in the responsible_person_email
column are
not
directly
dependent on an individual laptop. Rather, the email
column is tied to the responsible_person_id
, and the responsible_person_id
is, in turn,
dependent on the individual laptop. This transitive dependency breaks
3NF, indicating that the responsible_person_email
column doesn’t belong
there.
In the employee
table, we will also find both a person_id
column and an email
column. This is perfectly acceptable if the
person_id
is the primary key
(likely). That would make the email
column directly dependent on the primary key, keeping the table in
3NF.
A good way to recognize columns that may break 3NF
is to look for pairs or sets of unrelated columns that need to be kept
in sync with each other. Consider the laptop
table. If a system was reassigned to a new
person, you would always update both the responsible_person_id
column and the responsible_person_email
column. The
need to keep columns in sync with each other is a strong indication of
a dependency to each other, rather than to the primary key.
We’re not going to get into the details of BCNF, or the Fourth or Fifth (or beyond) Normal Forms, other than to mention that the Fourth and Fifth Normal Forms start to deal with inter-table relationships and how different tables interact with each other. Most database designers make a solid effort to get everything into 3NF and then stop worrying about it. It turns out that if you get the hang of things and tend to turn out table designs that are in 3NF, chances are pretty good that your tables will also meet the conditions for 4NF and 5NF, if not higher. To a large extent, the higher Normal Forms are formal ways of addressing some edge cases that are somewhat unusual, especially in simpler designs.
Although the conditions of the Normal Forms build on each other, the typical design process doesn’t actually iterate over the individual Forms. You don’t sit down with a new design and alter it until everything is 1NF, just to turn around and muck with the design until everything is 2NF, and so on, in a isolated step-by-step manner. Once you understand the ideas and concepts behind the First, Second, and Third Normal Forms, it becomes second nature to design directly to 3NF. Stepping over the conditions one at a time can help you weed out especially difficult trouble spots, but it doesn’t take long to gain a sense of when a design looks clean and when something “just ain’t right.”
The core concept to remember is that each table should try to represent one and only one thing. The primary key(s) for that table should uniquely and inherently identify the concept behind the table. All other columns should provide supporting data specific to that one concept. When speaking of the first three Normal Forms in a 1982 CACM article, William Kent wrote that each non-key column “ . . . must provide a fact about the key, the whole key, and nothing but the key.” If you incorporate only one formal aspect of database theory into your designs, that would be a great place to start.
3.129.210.17