Deciding on normalization

When starting with databases, one of the first things people learn is data normalization. Normalization makes a lot of sense because it helps to reduce redundancy and it ensures that data is structured in a professional and nice way.

However, in some cases, normalization is taken way too far.

The 7th normal form

As mentioned in many cases, normalization can be easily exaggerated. In such a case, performance will suffer and things will be really complex. Here is an example of normalization taken too far: a few years ago, a professor at a local university confronted me with an interesting example when I was talking to students. He told me that he uses the following example to explain to students that normalization can easily be taken too far. He invented the 7th normal form (which, of course, does not exist in professional database literature).

Here is an example:

test=# CREATE TABLE t_test (name text);
CREATE TABLE
test=# INSERT INTO t_test VALUES ('sealevel'),
INSERT 0 1

There is still some redundancy in sealevel. Some letters show up twice, so we can normalize that:

test=# CREATE TABLE t_char (id int, char char(1));
CREATE TABLE
test=# INSERT INTO t_char VALUES
  (1, 's'), (2, 'e'), (3, 'a'), (4, 'l'), (5, 'v'),
INSERT 0 5

In this table, one entry is made for each letter. In addition to this, a second table can be created to finally form the word:

test=# CREATE TABLE t_word (order_id int, char_id int);
CREATE TABLE

The sealevel can, therefore, be represented as the following table:

test=# INSERT INTO t_word VALUES
  (1, 1), (2, 2), (3, 3), (4, 4), (5, 2),
  (6, 5), (7, 2), (8, 4);
INSERT 0 8

Joining entries together will reveal the word:

test=# SELECT order_id, char
    FROM   t_word AS a, t_char AS b
    WHERE a.char_id = b.id
    ORDER BY order_id;
 order_id | char
----------+------
        1 | s
        2 | e
        3 | a
        4 | l
        5 | e
        6 | v
        7 | e
        8 | l
(8 rows)

What is the point of all this? The point is that you can normalize your data beyond recognition. But does it really make sense? In the example just outlined, everybody will agree that this is clearly useless and way beyond what is justified. Therefore, it is highly recommended if it is really worth introducing new relations, even if there is no chance of ever making changes to them.

Arrays versus normalizations

To avoid excessive normalization, you can turn your attention to arrays. Arrays are a simple, and in many cases, a good way to reduce the number of relations, especially when you don't need hard integrity.

What is the main benefit of an array and of avoiding joins? Keep in mind that when using a join, you have to read data from two tables. This often means additional I/O, and most importantly, you have to join the data, which leads to sorting, hashing, merging, and so on.

If data is stored inline in an array, there is no need to read from two tables and join, merge, and so on. You may be able to enjoy the simplicity of a sequential scan or the speed of a single-index lookup to retrieve the data needed.

Tip

Note that arrays are no Holy Grail! They are just a tool used to speed things up and a good option to improve data structures to prevent I/O problems.

To show how arrays can be used to reduce normalization, the following example may be beneficial to you. The goal is to come up with a data structure for storing which people have been to which countries:

test=# CREATE TABLE t_country (
  id       int,
  country_code   char(2)
);
CREATE TABLE

Then some data can be added to the table:

test=# INSERT INTO t_country VALUES
  (1, 'DE'), (2, 'AT'), (3, 'BE'), (4, 'NL'),
INSERT 0 4

In the next step, a table linking people and countries together is created:

test=# CREATE TABLE t_travel (
  person     text,
  country_id     int);   
CREATE TABLE

Basically, this is the standard way of describing a relation like that. Take a look at this example:

test=# CREATE TABLE t_travel (
  person     text,
  country_list   char(2)[]
);
CREATE TABLE

It is very unlikely that the name of a country will change, and even if it does, you may not care too much. Imagine that somebody has been to the Soviet Union. Even after its breakup, you would still not change the name of the country you have visited in retrospect. Therefore, integrity is really not an issue here.

The beauty is that you can save on all the expenses that might be caused by the join. In fact, in large-scale systems, random I/O can be a major bottleneck, which should be avoided. By keeping data close together, you can avoid a ton of problems.

In fact, too many joins on too much data can be a real source of trouble in a productive system because as soon as your server hits a random I/O limitation imposed on you by your disk, performance will drop dramatically.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.71.21