PostgreSQL database components

A PostgreSQL database could be considered as a container for database schema; the database must contain at least one schema. A database schema is used to organize the database objects in a manner similar to namespaces in high programing languages.

Schema

Object names can be reused in different schema without conflict. The schema contains all the database named objects, including tables, views, functions, aggregates, indexes, sequences, triggers, data types, domains, and ranges.

Schema

PostgreSQL schema as a database object container

By default, there is a schema called public in the template databases. That means, all the newly created databases also contain this schema. All users, by default, can access this schema implicitly. Again this is inherited from the template databases. Allowing this access pattern stimulates the situation where the server is not schema-aware. This is useful in small companies where there is no need to have complex security. Also, this enables smooth transition from the non-schema-aware databases.

Tip

Warning

In a multiuser and multidatabase environment set up, remember to revoke the ability for all users to create objects in the public schema. This is done by the following command in the newly created database, or in the template1 database:

REVOKE CREATE ON SCHEMA public FROM PUBLIC;

When a user wants to access a certain object, he needs to specify the schema name and the object name separated by a period(.). If the database search_path setting does not contain this name, or if the developer likes to use full qualified names (for example, to select all the entries in pg_database in the pg_catalog schema), one needs to write the following command:

SELECT * FROM pg_catalog.pg_database;

Alternatively you can also use the following command:

TABLE pg_catalog.pg_database;

Qualified database object names are sometimes tedious to write, so many developers prefer to use the unqualified object name, which is composed of only the object name without the schema. PostgreSQL provides a search_path setting that is similar to the using directive in the C++ language. The search path is composed of schemas that are used by the server to search for the object. The default search path, as shown in the following code, is $user, public. If there is a schema with the same name as the user, then it will be used first to search for objects or creating new objects. If the object is not found in the schemas specified in the search_path, then an error will be thrown:

SHOW search_path;
--------------
 $user,public

Schema usages

Schemas are used for the following reasons:

  • Control authorization: In a multi-user database environment, one can use schemas to group objects based on roles.
  • Organize database objects: One can organize the database objects in groups based on the business logic. For example, historical and auditing data could be logically grouped and organized in a specific schema.
  • Maintain third-party SQL code: The extensions available in the contribution package can be used with several applications. Maintaining these extensions in separate schemas enables the developer to reuse these extensions, and to update them easily.

In the car web portal, let us assume that we would like to create a schema named car_portal_app, owned by car_portal_app role. This can be done as follows:

CREATE SCHEMA car_portal_app AUTHORIZATION car_portal_app;
--The schema owner is the same as the schema name if not given
CREATE SCHEMA AUTHORIZATION car_portal_app;

For more information about the syntax of the CREATE SCHEMA command, one can use the psql h meta-command, which displays the psql client tool inline help, or take a look at the PostgreSQL manual at http://www.postgresql.org/docs/current/static/sql-createschema.html.

Table

The CREATE TABLE SQL statement is very rich. It can be used for several purposes such as cloning a table, which is handy for database refactoring to create the uninstallation script to rollback changes. Also, it can be used to materialize the result of the SELECT SQL statement to boost performance, or for temporarily storing the data for later use.

Tip

The PostgreSQL tables are used internally to model views and sequences.

In PostgreSQL, tables can be of different types:

  • Permanent table: The table life cycle starts with table creation and ends with table dropping.
  • Temporary table: The table life cycle is the user session. This is used often with procedural languages to model some business logic.
  • Unlogged table: Operations on unlogged tables are much faster than on permanent tables, because data is not written into the WAL files. Unlogged tables are not crash-safe. Also, since streaming replication is based on shipping the log files, unlogged tables cannot be replicated to the slave node.
  • Child table: A child table is a table that inherits one or more tables. The inheritance is often used with constraint exclusion to physically partition the data on the hard disk and to gain performance in retrieving a subset of data that has a certain value.
  • The create table syntax is quite long; the full syntax of create table can be found at http://www.postgresql.org/docs/current/static/sql-createtable.html. The create table SQL command normally requires the following inputs:
    • Table name of the created table.
    • The table type.
    • The table storage parameters. These parameters are used to control the table storage allocation and several other administrative tasks.
    • The table columns, including the data type, default values, and constraint.
    • The cloned table name and the options to clone the table.

PostgreSQL native data types

When designing a database table, one should take care in picking the appropriate data type. When the database goes to production, changing the data type of a column might become a very costly operation, especially for heavily loaded tables. The cost often comes from locking the table, and in some cases, rewriting it. When picking a data type, consider a balance between the following factors:

  • Extensibility: Can the maximum length of a type be increased or decreased without a full table rewrite and a full table scan?
  • Data type size: Going for a safe option such as choosing big integers instead of integers, will cause more storage consumption.

PostgreSQL provides a very extensive set of data types. Some of the native data type categories are:

  • Numeric type
  • Character type
  • Date and time types

These data types are almost common for all relational databases. Moreover, they are often sufficient for modeling traditional applications.

Numeric types

The following table shows various numeric types:

Name

Comments

Size

Range

smallint

SQL equivalent: Int2

2 bytes

-32768 to +32767

Int

SQL equivalent:Int4

Integer is an alias for int

4 bytes

-2147483648 to +2147483647

Bigint

SQL equivalent: Int8

8 bytes

-9223372036854775808 to +9223372036854775807

Numeric or decimal

No difference in PostgreSQL

variable

up to 131072 digits before the decimal point; up to 16383 digits after the decimal point

real

Special values:

Infinity

-Infinity

NaN

4 bytes

Platform dependent, at least 6 digit precision. Often the range is 1E-37 to 1E+37

Double precision

8 bytes

Platform dependent, at least 15 digit precision, often the range is 1E-307 to 1E+308

Numeric data types

PostgreSQL supports various mathematical operators and functions, such as geometric functions and bitwise operations. The smallint data type can be used to save disk space, while bigint can be used if the integer range is not sufficient.

Serial types, namely smallserial, serial, and bigserial are wrappers on top of smallint, int, and biginteger respectively. Serial types are often used as surrogate keys, and by default, they are not allowed to have a null value. The serial type utilizes the sequences behind the scene. A sequence is a database object that is used to generate sequences by specifying the minimum, maximum, and increment values.

For example, the following code creates a table customer with a column customer_id:

CREATE TABLE customer (
    customer_id SERIAL
);

This will generate the following code behind the scene:

CREATE SEQUENCE custome_customer_id_seq;
CREATE TABLE customer (
customer_id integer NOT NULL DEFAULT nextval('customer_customer_id_seq')
);
ALTER SEQUENCE customer_customer_id_seq OWNED BY customer.Customer_id;

When creating a column with type serial, remember the following things:

  • A sequence will be created with the name tableName_columnName_seq. In the preceding example, the sequence name is customer_customer_id_seq.
  • The column will have a Not Null value constraint.
  • The column will have a default value generated by the function nextval().
  • The sequence will be owned by the column, which means that the sequence will be dropped automatically if the column is dropped.

Tip

The preceding example shows how PostgreSQL renames an object if the object name is not specified explicitly; PostgreSQL renames objects using the {tablename}_{columnname(s)}_{suffix} pattern, where the suffixes pkey, key, excl, idx, fkey, and check which stand for a primary key constraint, a unique constraint, an exclusion constraint, an index, a foreign key constraint, and a check constraint respectively.

A common mistake when using the serial type is forgetting to grant proper permissions to the generated sequence.

Similar to the C language, the result of an integer expression is also an integer. So, the results of the mathematical operations 3/2 and 1/3 is 1 and 0 respectively. Thus, the fractional part is always truncated. Unlike C language, the postgres rounds off the numbers when casting a double value to int:

postgres=# SELECT CAST (5.9 AS INT) AS rounded_up, CAST(5.1 AS INTEGER) AS rounded_down, 5.5::INT AS another_syntax;
 rounded_up | rounded_down | another_syntax
------------+--------------+----------------
          6 |            5 |              6
(1 row)

postgres=# SELECT 2/3 AS "2/3", 1/3 AS "1/3", 3/2 AS "3/2";
 2/3 | 1/3 | 3/2
-----+-----+-----
   0 |   0 |   1
(1 row)

The numeric and decimal types are recommended for storing monetary and other amounts where precision is required. There are three forms for defining a numeric or a decimal value:

  • Numeric (precision, scale)
  • Numeric (precision)
  • Numeric

Precision is the total number of digits, while scale is the number of digits of the fraction part. For example, the number 12.344 has a precision of five and a scale of three. If a numeric type is used to define a column type without precision or scale, then the column can store any number with any precision and scale.

Tip

If precision is not required, do not use the numeric and decimal types. Operations on numeric types are slower than floats and double precision.

Floating point and double precision are inexact; that means that the values in some cases cannot be represented in the internal binary format, and are stored as approximation.

Character types

The following table shows various character types:

Name

Comments

Trailing spaces

Maximum length

"char"

Equivalent to char(1)

It must be quoted as shown in the name

Semantically insignificant

1

name

Equivalent to varchar(64)

Used by postgres for object names

Semantically significant

64

char(n)

Alias: character(n)

Fixed length character where the length is n.

Internally called bpchar (blank padded character)

Semantically insignificant

1 to 10485760(10*1024*1024)

Varchar(n)

Alias: character varying(n)

Variable length character where the maximum length is n

Semantically significant

1 to 10485760

Text

Variable length character

Semantically significant

Unlimited

Character data types

PostgreSQL provides two general text types, which are char(n) and varchar(n) data types, where n is the number of characters allowed. In the char data type, if a value is less than the specified length, then trailing spaces are padded at the end of the value. Operations on the char data types ignore the trailing spaces. Take a look at the following example:

postgres=# SELECT 'a'::CHAR(2) = 'a '::CHAR(2);
 ?column?
----------
 t
(1 row)

postgres=# SELECT length('a         '::CHAR(10));
 length
--------
      1
(1 row)

It is not recommended to perform binary operations on varchar or text and char strings due to trailing spaces.

For both char and varchar data types, if the string is longer than the maximum allowed length, then:

  • An error will be raised in the case of insert or update unless the extra characters are all spaces. In the latter case, the string will be truncated.
  • In the case of casting, extra characters will be truncated automatically without raising an error.

The following example shows how mixing different data types might cause problems:

postgres=# SELECT 'a '::VARCHAR(2)='a '::text;
 ?column?
----------
 t
(1 row)
postgres=# SELECT 'a '::CHAR(2)='a '::text;
 ?column?
----------
 f
(1 row)
postgres=# SELECT 'a '::CHAR(2)='a '::VARCHAR(2);
 ?column?
----------
 t
(1 row)
postgres=# SELECT length ('a '::CHAR(2));
 length
--------
      1
(1 row)
postgres=# SELECT length ('a '::VARCHAR(2));
 length
--------
      2
(1 row)

The preceding example shows that 'a '::CHAR(2) equals to 'a '::VARCHAR(2), but both have different lengths, which is not logical. Also, it shows that 'a '::CHAR(2) is not equal to 'a '::text. Finally, 'a '::VARCHAR(2) equals 'a '::text. The preceding example causes confusion because if a variable a is equal to b and b is equal to c, then a is equal to c according to mathematics.

The PostgreSQL text storage size depends on several factors, namely, the length of the text value, and the text decoding and compression. The text data type can be considered as an unlimited varchar() type. The maximum text size that can be stored is 1 GB, which is the maximum column size.

For fixed length strings, the character data type and the character varying data type consume the same amount of hard disk space. For variable length character, the character varying data type consumes less space, because character type appends the string with space. The following code shows the storage consumption for fixed and variable length texts for the character and character varying data types. It simply creates two tables, populates the tables with fictional data using fixed and variable length strings, and finally gets the table size in a human readable form:

CREATE TABLE char_size_test (
  size CHAR(10)
);
CREATE TABLE varchar_size_test(
  size varchar(10)
);
WITH test_data AS (
  SELECT substring(md5(random()::text), 1, 10) FROM generate_series (1, 1000000)
),
cahr _data_insert AS (  
  INSERT INTO char_size_test SELECT * FROM test_data
)
INSERT INTO varchar_size_test SELECT * FROM test_date;
-- Get the table size in human readable form
SELECT pg_size_pretty(pg_relation_size ('char_size_test')) AS char_size_test , pg_size_pretty(pg_relation_size ('varchar_size_test')) AS varchar_size_test;
-- Delete the tables data
TRUNCATE char_size_test;
TRUNCATE varchar_size_test;
-- Insert data with fixed length  
WITH test_date AS (
  SELECT substring(md5(random()::text), 1, (random()* 10)::int) FROM generate_series (1, 1000000)
),
cahr _data_insert AS (  
  INSERT INTO char_size_test SELECT * FROM test_date
)
INSERT INTO varchar_size_test SELECT * FROM test_date;

SELECT pg_size_pretty(pg_relation_size ('char_size_test')) AS char_size_test , pg_size_pretty(pg_relation_size ('varchar_size_test')) AS varchar_size_test;  
-- Create tables

The varchar data type can be emulated by the text data type and a check constraint to check the text length. For example, the following code snippets are semantically equivalent:

CREATE TABLE emulate_varchar(
test VARCHAR(4)
);
--semantically equivalent to
CREATE TABLE emulate_varchar (
test TEXT,
CONSTRAINT test_length CHECK (length(test) <= 4)
);

In PostgreSQL, there is no difference in performance between the different character types, so it is recommended to use the text data type. It allows the developer to react quickly to the changes in business requirements. For example, one common business case is changing the text length, such as changing the length of a customer ticket number from six to eight characters due to length limitation, or changing how certain information are stored in the database. In such a scenario, if the data type is text, this could be done by amending the check constraint without altering the table structure.

Date and time types

The date and time data types are commonly used to describe the occurrence of events such as birth date. PostgreSQL supports the following date and time types:

Name

Storage size

Description

Low value

High value

Timestamp without time zone

8 bytes

Date and time without time zone

equivalent to timestamp

4713 BC

294276 AD

Timestamp with time zone

8 bytes

Date and time with time zone

equivalent to timestamptz

4713 BC

294276 AD

Date

4 bytes

Date only

4713 BC

5874897 AD

Time without time zone

8 bytes

Time of day

00:00:00

24:00:00

Time with time zone

12 bytes

Times of day only, with time zone

00:00:00+1459

24:00:00-1459

interval [ fields ]

16 bytes

Time interval

-178000000 years

178000000 years

Date/time data types

PostgreSQL stores timestamp with and without time zone in the universal coordinated time (UTC) format, and only time is stored without the time zone. This explains the identical storage size for both timestamp with time zone and time stamp without time zone.

There are two approaches for handling timestamp correctly. The first approach is to use timestamp without time zone, and let the client side handle the time zone differences. This is useful for in-house development, applications with only one time zone, and when the clients know the time zone differences.

The other approach is to use timestamp with time zone. The following are some of the best practices to avoid the common pitfalls when using timestamptz:

  • Make sure to set the default time zone for all connections. This is done by setting the time zone configuration in the postgresql.conf file. Since PostgreSQL stores the timestamp with the time zone in UTC format internally, it is a good practice to set the default connection to UTC as well. Also, UTC helps in overcoming the potential problems due to Daylight Saving Time (DST).
  • The time zone should be specified in each CRUD operation.
  • Do not perform operations on timestamp without time zone and timestamp with time zone, this will normally lead to wrong results due to implicit conversion.
  • Do not invent your own conversion; instead, use the database server to convert between the different time zones.
  • Investigate the data types of high level languages to determine which type could be used with PostgreSQL without extra handling.

PostgreSQL has two important settings: timezone and datestyle. The datestyle has two purposes:

  • Setting the display format: The datestyle specifies the timestamp and timestamptz rendering style.
  • Interpreting ambiguous data: The datestyle specifies how to interpret timestamp and timestamptz.

The views pg_timezone_names and pg_timezone_abbrevs provide a list of the time zone names and abbreviations respectively. They also provide information regarding the time offset from UTC, and if the time zone is a DST. For example, the following code snippet sets the timezone setting to Jerusalem, and then retrieves the local date and time in Jerusalem:

postgres=> SET timezone TO 'Asia/jerusalem';
SET
postgres=> SELECT now();
              now              
-------------------------------
 2014-08-27 23:49:49.611633+03
(1 row)

The PostgreSQL AT TIME ZONE statement converts the timestamp with or without the timezone to a specified time zone; its behavior depends on the converted type. The following example clarifies this construct:

postgres=> SHOW timezone;
 TimeZone
----------
 UTC
(1 row)

postgres=> x
Expanded display is on.
postgres=> SELECT now(), now()::timestamp, now() AT TIME ZONE 'CST', now()::timestamp AT TIME ZONE 'CST';
-[ RECORD 1 ]--------------------------
now      | 2014-08-27 21:00:18.36009+00
now      | 2014-08-27 21:00:18.36009
timezone | 2014-08-27 15:00:18.36009
timezone | 2014-08-28 03:00:18.36009+00

The function now() returns the current timestamp with the time zone in the UTC format. Notice that the time zone offsite is +00. When casting the time stamp with the time zone to timestamp as in now()::timestamp, the time zone offsite is truncated. The now() AT TIME ZONE 'CST' expression converts the timestamp with the time zone UTC to timestamp in the specified time zone CST. Since the central standard time offset is -6, then six hours are deducted. The last expression now()::timestamp AT TIME ZONE 'CST' is reinterpreted as a timestamp as being in that time zone CST for the purpose of converting it to the connection default time zone UTC. So, the last expression is equivalent to the following:

postgres=> SELECT ('2014-08-27 21:00:18.36009'::timestamp AT time zone 'CST' AT TIME ZONE 'UTC')::timestamptz;
-[ RECORD 1 ]--------------------------
timezone | 2014-08-28 03:00:18.36009+00

One can summarize the conversion between the timestamp with and without the time zone, as follows:

  • The expression timestamp without time zone AT TIME ZONE x is interpreted as follows: the timestamp will be converted from the time zone x to the session time zone.
  • The expression timestamp with time zone AT TIME ZONE x converts a timestamptz into a timestamp at the specified time zone x. The final result type is timestamp.

The date is recommended when there is no need to specify the time such as birth date, holidays, absence days, and so on.

Time with time zone storage is 12 bytes, 8 bytes are used to store the time, and 4 bytes are used to store the time zone. The time without time zone consumes only 8 bytes. Conversions between time zones can be made using the AT TIME ZONE construct.

Finally, the interval data type is very important in handling the timestamp operations as well as describing some business cases. From the point of view of functional requirements, the interval data type can represent a period of time such as estimation time for the completion of a certain task. The result type of the basic arithmetic operations such as + and - on timestamp, timestamptz, time, and time with time zone is of the type interval. The result of the same operations on date type is an integer. The following example shows timestamptz and date subtraction. Notice the format of the specifying intervals:

SELECT '2014-09-01 23:30:00.000000+00'::timestamptz -'2014-09-01 22:00:00.000000+00'::timestamptz = Interval '1 hour, 30 minutes';
 ?column?
----------
 t
(1 row)
postgres=> SELECT '11-10-2014'::date -'10-10-2014'::date = 1;
 ?column?
----------
 t
(1 row)

The car web portal database

At this stage, one can convert the logical model of the car web portal presented in Chapter 1, Relational Databases to a physical model. To help the developer to create a table, one can follow this minimal check list:

  • What is the primary key?
  • What is the default value for each column?
  • What is the type of each column?
  • What are the constraints on each column or set of columns?
  • Are permissions set correctly on tables, sequences, and schemas?
  • Are foreign keys specified with the proper actions?
  • What is the data life cycle?
  • What are the operations allowed on the data?

For creating the car web portal schema, the formal relational model will not be applied strictly. Also, surrogate keys will be used instead of natural keys for the following reasons:

  • Natural keys can change; one can change the current e-mail address to another one. Using a surrogate key guarantees that if a row is referenced by another row, then this reference is not lost, because the surrogate key has not changed.
  • Incorrect assumptions about natural keys. Let us take e-mail address as an example. The general assumption about an e-mail address is that it identifies a person uniquely. This is not true; some e-mail service providers set policies such as e-mail expiration based on activity. Private companies might have general e-mail addresses such as contact@.., support@..., and so on. The same is applicable to phone and mobile numbers.
  • Surrogate keys can be used to support a temporal database design within the relational database world. For example, some companies have a very strict security requirement, and data should be versioned for each operation.
  • Surrogate keys often use compact data types such as integers. This allows for better performance than composite natural keys.
  • Surrogate keys can be used in PostgreSQL to eliminate the effect of cross column statistic limitation. PostgreSQL collects statistics per single column. In some cases, this is not convenient because columns might be correlated. In this case, PostgreSQL gives a wrong estimation to the planner, and thus, imperfect execution plans are generated.
  • Surrogate keys are better supported than the natural keys by object relational mappers such as hibernate.

Despite all these advantages of surrogate keys, it also has a few disadvantages:

  • A surrogate key is auto generated, and the generation of the value might give different results. For example, one inserts a data in a test database and staging a database, and after the comparison of data, the data was not identical.
  • A surrogate key is not descriptive. From the communication point of view, it is easier to refer to a person by a name instead of an auto generated number.

In the web car portal ER diagram, there is an entity with the name user. Since user is a reserved keyword, the name account will be used for creating the table. Note that to create a database object using a PostgreSQL keyword, the name should be quoted. The following example shows how to create a table user:

postgres=# set VERBOSITY 'verbose'
postgres=# CREATE TABLE user AS SELECT 1;
ERROR:  42601: syntax error at or near "user"
LINE 1: CREATE TABLE user AS SELECT 1;
                     ^
LOCATION:  scanner_yyerror, srcackendparserscan.l:1053
postgres=# CREATE TABLE "user" AS SELECT 1;
SELECT 1
postgres=#

You can find the full list of reserved words at http://www.postgresql.org/docs/current/static/sql-keywords-appendix.html.

Tip

In the preceding example, the VERBOSITY setting for psql can be used to show error codes. Error codes are useful in detecting errors and trapping exceptions.

To create a table account, one can execute the following command:

CREATE TABLE account (
  account_id SERIAL PRIMARY KEY,
  first_name TEXT NOT NULL,
  last_name TEXT NOT NULL,
  email TEXT NOT NULL UNIQUE,
  password TEXT NOT NULL,
  CHECK(first_name !~ 's' AND last_name !~ 's'),
  CHECK (email ~* '^w+@w+[.]w+$'),
  CHECK (char_length(password)>=8)
);

To summarize the user table:

  • The account_id is defined as the primary key with type serial. The account_id is naturally unique and not null.
  • The attributes first_name, last_name, email, and password are not allowed to have null values.
  • The first_name and the last_name attributes are not allowed to have spaces.
  • The password should be at least eight characters in length. In reality, the password length is handled in business logic, since passwords should not be stored in a plain text format in the database. For more information about securing the data, have a look at Chapter 8, PostgreSQL Security.
  • The e-mail should match a certain regex expression. Note that the e-mail regular expression is really simplistic.

Behind the scene, the following objects are created:

  • A sequence to emulate the serial type
  • Two indices, both of them are unique. The first one is used for validation of the primary key—account_id. The second is used for the validation of the e-mail address.

To create the seller account, one can execute the following statement:

CREATE TABLE seller_account (
  seller_account_id SERIAL PRIMARY KEY,
  account_id INT UNIQUE NOT NULL REFERENCES account(account_id), number_of_advertizement advertisement INT DEFAULT 0,
  user_ranking float,  
  total_rank float
);

As we can see, the seller account has a one-to-one relationship with the account. This is enforced by the account_id that consists of NOT NULL and UNIQUE constraints. Also, in this case, one can model the seller account as follows by marking the account_id as the primary key:

CREATE TABLE seller_account (
  account_id INT PRIMARY KEY REFERENCES account(account_id)
...
  );

The first design is more flexible and less ambiguous. First of all, the requirement might change, and the user account and the seller account relation might change from one-to-one to one-to-many. For example, the user concept might be generalized to handle companies where the company has several seller accounts.

Also, if a table references both the account and seller_account tables using a common column name such as account_id, then it will be difficult to distinguish which table is being referenced.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.67.0