Deciding on data types and structure

After this brief introduction to organizing columns, it makes sense to discuss various other issues, such as choosing the right data type and normalization. Normalization is a standard technique to organize data and it has proven to be essential over the years. The idea is to avoid redundancy to ensure that changes only have to be made in one place. To make your database project a success, deciding on proper types is essential and will help to prevent a great deal of trouble later on in the project. The goal of this chapter is to make you aware of the potentials of using the right types.

Finding the right type

Finding the right data type sounds easy. However, there is a lot to consider, and there are some issues that might actually cause problems.

varchar versus text

The first two types to be compared are varchar and text. Have you ever seen a data structure like this?

test=# CREATE TABLE t_product (
  prodnr     int,
  name       varchar(20),
  description     varchar(4096)
);
CREATE TABLE

A product number defined as an integer can make a lot of sense (some people use alphanumeric product names, but never mind). Keeping the name limited to 20 characters is also reasonable. There might be real limitations here. Maybe, a field on some printout is limited to 20 characters, or some label somewhere is limited to a 20-character title.

But what about the description? Does 4096 really impose a useful restriction here? I guess not. It is just an arbitrary number. What the person really meant is rather something like this:

description    text

The beauty of text is that it is not limited to a certain number of characters but to 1 GB per entry. Hence, text can make life for software developers a little easier.

numeric versus floating point

While varchar versus text was more a question of convenience, numeric versus floating point is a matter of life and death because it is all about correctness.

What is the difference anyway? Here is an example using float:

test=# SELECT '1000.1'::float - '1000' AS float;
       float       
-------------------
 0.100000000000023
(1 row)

What you see here is that the example is close, but not necessarily absolutely correct. The reason is the way floating-point numbers work.

Tip

The inner working of the floating point standard is defined in IEEE 754, which is described at:

http://en.wikipedia.org/wiki/IEEE_floating_point.

The main issue is that depending on the magnitude of the number, precision will go down. This might be fine if you are dealing with some sort of sensor data. However, it is totally impossible if you happen to work with money. The price of a stamp is the same, no matter whether you are living in a slum or you happen to be a billionaire. The number is always a static number. With floating point, it would be merely impossible to achieve that.

To get around the chip's FPU (short for floating-point unit), developers and designers alike can turn to numeric. The beauty of numeric is that it relies entirely on the working of the CPU's integer unit. It offers fixed, predefined, and standard behavior to handle numbers of arbitrary precision. It is perfect for numbers, as the following example shows:

test=# SELECT '1000.1'::numeric - '1000';
 ?column?
----------
      0.1
(1 row)

The numeric option should be used for financial data and all other applications requiring precision.

Note

The numeric is a lot faster than floating-point-based data types. However, it's not about speed; it's about the correctness of your results. A correct result is usually a lot more important than a fast result.

boolean fields versus bit fields

In the past, I have seen countless situations in which people have used a massive number of boolean columns in a table. This is not especially harmful, but it might not be the best thing either. A table in PostgreSQL can hold up to 1600 columns. However, keeping 100 boolean columns for a certain scenario is not too handy.

In PostgreSQL, you can consider the following way out: two data types called bit and varbit are offered to reduce the number of columns and simplify things. Here is an example:

test=# CREATE TABLE t_test (x bit(10));
CREATE TABLE
test=# INSERT INTO t_test VALUES ('0110000101'),
INSERT 0 1
test=# SELECT * FROM t_test;
     x      
------------
 0110000101
(1 row)

In this example, a bit(10) column has been created. The meaning of bit(10) is that a field with 10 bits is created.

If you want to retrieve the data, it is possible to simply select the entire column or to retrieve a single bit inside the field (get_bit). Of course, bits can also be modified (set_bit). Here is an example:

test=# SELECT get_bit(x, 5), set_bit(x, 5, 1), x::int FROM t_test;
 get_bit |  set_bit   |  x  
---------+------------+-----
       0 | 0110010101 | 389
(1 row)

If you are not sure how many bits you will need, there is a data type called varbit. Basically, it works just like varchar. Here is an example:

test=# ALTER TABLE t_test ADD COLUMN y varbit(20);
ALTER TABLE

A column that can hold up to 20 bits, is added to t_test.

As mentioned before, bit and varbit are just options. It is not necessary to use this kind of stuff, but they can help to improve your data structures in a way to prevent trouble.

text versus cidr/inet/circle/point

One common mistake made by many people is to choose a generic data type instead of a specific, optimized data type. Of course, everything can be stored as text somehow, but in most cases, it makes a lot more sense to use the proper data type.

Some of those specialized, optimized types are dealt with in this section.

Let's focus on IP addresses first. To represent an IP address, you have two choices: inet and cidr. Here is an example:

test=# SELECT '192.168.0.34/32'::cidr, '192.168.0.34/32'::inet;
      cidr       |     inet     
-----------------+--------------
 192.168.0.34/32 | 192.168.0.34
(1 row)

The cidr option represents an IP address that includes a netmask. The inet option will only store an IP address without a netmask.

By the way, both data types will also work for IPv6:

test=# SELECT '::1'::cidr, '::1'::inet;
  cidr   | inet
---------+------
 ::1/128 | ::1
(1 row)

The beauty here is that you have a special set of operators for those types, and you can even do the math, as the following example shows:

test=# SELECT '::1'::cidr + 43242342;
    ?column?     
-----------------
 ::2.147.211.103
(1 row)

Besides data types related to IPs, there are a couple of types related to geometric objects. A common mistake is that people store geographical positions in two columns:

test=# CREATE TABLE t_location (
  loc   text,
  x   numeric,
  y   numeric
);
CREATE TABLE

Of course, you can always use two fields, but querying this kind of data structure is a really nasty task. Just imagine searching an object within 50 kilometers. With two columns, this is clearly a complicated, nasty-to-do task.

The standard distribution of PostgreSQL (even without PostGIS, which is a popular GIS extension installed) offers some geometric data types such as points, circles, rectangles, paths, and so on. To make the previous table more efficient, the following can be done:

test=# CREATE TABLE t_test (loc text, coord point);
CREATE TABLE

Points can be used easily. Here is an example:

test=# SELECT '(10, 30)'::point;
  point  
---------
 (10,30)
(1 row)

The really nice thing here is that there is quite a lot of useful functionality around. If you want to determine the distance between two points, you can make use of the (distance) operator:

test=# SELECT '(10, 30)'::point <-> '(20, 20)'::point;
    ?column?     
-----------------
 14.142135623731
(1 row)

You could've done that with a simple formula, but why? PostgreSQL already has all the functionality for you. Therefore, using an optimized, predefined data type is highly beneficial.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.59.192