When a big database is shared by many applications, it is sometimes hard to understand who is using what, and what would happen if the database schema changes. In that case, it makes sense to build the whole system using layered architecture. The physical data structure is located at the bottom-most layer. Applications do not access it directly.
Moving upwards from the bottom, the second last layer contains structures that abstract logical entities from their physical implementation. These structures play the role of data abstraction interfaces. There are several ways to implement them. They can be created in the database as functions. In that case, applications will work with the data by invoking them. Another approach is by using updatable views. In that case, applications can access the logical entities with conventional SQL statements. Additionally, this interface can be implemented outside the database as a lightweight service processing the requests of high level systems, performing queries, and making changes to the database. Each approach has its own set of benefits and drawbacks.
At the top are the applications that implement business logic. They do not care about the physical structure of the data, and interact with the database through data abstraction interfaces.
This approach reduces the number of agents that access the physical data structures, and is clearly able to show how the database is used or can be used. The database documentation contains the specification of those interfaces. So the database developer, when working on refactoring the database schema, should only make sure that the interfaces follow the specification, and until they do, the rest of the database is "free to change".
The existence of these interfaces makes its easier to develop unit tests: it is clear what to test and how, since the specification is given.
The easiest way to create a database abstraction interface is by using views to access the data. In this case, if one wants to change the table structure, it can be done without changing the code of the external applications: the only thing necessary is updating the definitions of the interface views. However, it is important to check if the new view returns the same data as the old one.
The situation is the easiest when it is possible to implement a view in the same database. One just needs to create a copy of the production database in the test environment, or prepare a test database containing all possible combinations of the attributes of the business entities. Then the new version of the view can be deployed with a different name. The following query can then be used to see if the new view returns the same data:
WITH n AS (SELECT * FROM new_view), o AS (SELECT * FROM old_view) SELECT 'new', * FROM (SELECT * FROM n EXCEPT ALL SELECT * FROM o) a UNION ALL SELECT 'old', * FROM (SELECT * FROM o EXCEPT ALL SELECT * FROM n) b;
new_view
and old_view
refer to the names of the respective relations. The query returns no rows if both the views return the same result.
However, this works only when both views are in the same database, and the old view works as it worked before the refactoring. In case the structure of underlying tables changes, the old view cannot work as it did before, and the comparison in that case is not applicable. This problem can be solved by creating a temporary table from the data returned by the old view before refactoring, and then comparing that temporary table with the new view.
This can also be done by comparing the data from different databases: the old one before refactoring and the new one. One can use external tools to do so. For example, data from both the databases can be dumped into files using psql, and then those files can be compared using diff
(this will work only if the rows have the same order). There are some commercial tools as well, which provide this functionality.
Another approach is connecting two databases, making queries, and making the comparison inside the database. This might seem complicated, but in fact, it is the fastest and most reliable way. There are a couple of methods to connect two databases: through the extensions dblink
(database link) or postgres_fdw
(foreign data wrapper).
Using the dblink
extension may seem easier than using postgres_fdw,
and it allows performing different queries for different objects. However, this technology is older, it uses a syntax that is not standard-compliant, and has performance issues, especially when big tables or views are queried.
On the other hand, postgres_fdw
requires creating an object in the local database for each object in the remote database that is going to be accessed, which is not that convenient. However, that makes it easy to use the remote tables together with the local tables in queries, and it is faster.
In the example in the previous section, another database was created from the original database car_portal
, and another field was added to the table car_portal_app.car
. Let's try to find out if that operation caused changes in the data.
First, connect to the new database as a super user:
user@host:~$ psql -h localhost -U postgres car_portal_new psql (9.4.0) Type "help" for help. car_portal_new=#
Then create an extension for foreign data wrapper. The binaries for the extension are included in the PostgreSQL server package.
car_portal_new=# CREATE EXTENSION postgres_fdw ;
Once the extension has been created, create a server object and a user mapping:
car_portal_new=# CREATE SERVER car_portal_original FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'car_portal'); car_portal_new=# CREATE USER MAPPING FOR CURRENT_USER SERVER car_portal_original;
As the last step, create a foreign table:
car_portal_new=# CREATE FOREIGN TABLE car_portal_app.car_orignal (car_id int, number_of_owners int, registration_number text, manufacture_year int, number_of_doors int, car_model_id int, mileage int) SERVER car_portal_original OPTIONS (table_name 'car'); car_portal_new=# SELECT * FROM car_portal_app.car_orignal ; car_id | number_of_owners | registration_number | manufacture_year | number_of_doors | car_model_id | mileage --------+------------------+---------------------+------------------+-----------------+--------------+--------- 1 | 3 | MUWH4675 | 2008 | 5 | 65 | 67756 2 | 1 | VSVW4565 | 2014 | 3 | 61 | 6616 3 | 1 | BKUN9615 | 2014 | 5 | 19 | 48221 ... (229 rows)
Now the table is ready, and it can be queried. To compare the data, the same query can be used as an example for the old and new views:
WITH n AS (SELECT car_id, number_of_owners, registration_number, manufacture_year, number_of_doors, car_model_id, mileage FROM car_portal_app.car), o AS (SELECT * FROM car_portal_app.car_orignal) SELECT 'new', * FROM (SELECT * FROM n EXCEPT ALL SELECT * FROM o) a UNION ALL SELECT 'old', * FROM (SELECT * FROM o EXCEPT ALL SELECT * FROM n) b; ?column? | car_id | number_of_owners | registration_number | manufacture_year | number_of_doors | car_model_id | mileage ----------+--------+------------------+---------------------+------------------+-----------------+--------------+--------- (0 rows)
The result is zero rows, which means the data in both the tables is the same.
3.15.203.124