Cleaning up data in the database

Cleaning up data is an important topic; often, the data life cycle is not defined when creating a database application. This leads to tons of outdated data. Unclean data hinders several processes, such as database refactoring. Also, it can have a side effect on all processes in the company, such as wrong report results, billing issues, unauthorized access, and so on.

Getting ready

Several recipes were introduced to determine unused objects, but this is not all. The data itself should be cleaned, and the data life cycle should be defined.

For unclean data, there are several scenarios; however, let's focus here only on duplicated rows due to the missing unique and primary key constraints.

How to do it…

The first step is to identify the tables that do not have unique and primary key constraints. This is quite easy using the information schema, as follows:

SELECT table_catalog, table_schema, table_name
FROM
  information_schema.tables
WHERE
  table_schema NOT IN ('information_schema', 'pg_catalog')
EXCEPT
SELECT
  table_catalog, table_schema, table_name
FROM
  information_schema.table_constraints
WHERE
  constraint_type IN ('PRIMARY KEY', 'UNIQUE') AND
  table_schema NOT IN ('information_schema', 'pg_catalog');

The second step is to identify the tables that really contain duplicates; this can be performed by aggregating the data on the table. To check this, let's create a table with some duplicates in it, as follows:

CREATE TABLE duplicate AS SELECT (random () * 9  + 1)::INT as f1 , (random () * 9 + 1)::INT as f2 FROM generate_series (1,20);

SELECT count(*), f1, f2 FROM duplicate GROUP BY f1, f2 having count(*) > 1 limit 3;
 count | f1 | f2
-------+----+----
     3 |  7 |  7
     2 |  7 |  2
     2 |  7 |  6
(3 rows)

The tricky part is to delete the duplicates as the rows are identical. To delete duplicates, a certain row needs to be marked to stay, and the rest need to be deleted. This can be achieved using the ctid column. In PostgreSQL, each row has a header, and the ctid column is the physical location of the row version within the table. Thus, the ctid column can be used as a row identifier temporarily because this identifier may change after running maintenance commands, such as CLUSTER.

To delete the duplicate, one can perform the following query:

with should_not_delete as (
  SELECT min(ctid) FROM duplicate group by f1, f2
) DELETE FROM duplicate WHERE ctid NOT IN (SELECT min FROM should_not_delete);

There's more…

There are several other approaches to clean up duplicate rows. For example, one can use the CREATE TABLE and SELECT DISTINCT statements to create a table with a unique set of rows. Then, one can drop the original table and rename the created table after the original table, as shown in the following example:

CREATE TABLE <tmp> AS SELECT DISTINCT * FROM <orig_tbl>;
DROP TABLE <orig_tbl>;
ALTER TABLE <tmp> RENAME TO <orig_tbl>;

Note that this approach might be faster than the approach represented in the preceding example; however, this technique may not work if there are other objects depending on the table that needs to be dropped, such as views, indexes, and so on.

If the table that contains duplicate records has a primary key, one can drop the table using the DELETE…USING statement, as follows:

DELETE FROM dup_table a USING dup_table b
WHERE a.tt1 = b.tt1 AND ... AND b.attn= b.attn
AND a.pk < p.pk.

The list of attributes, att1,..., attn, is used to join the table with itself, and the primary key indicated in the code as pk is used to specify the record that needs to be deleted. Note that this approach is often faster than the first approach as it does not require aggregation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.42.233