Select the unique rows from the table into a second table that
you use to replace the original one. Or add a unique index to the
table using ALTER
TABLE
, which will remove duplicates as it builds the index. Or
use
DELETE
... LIMIT
n
to remove all but one instance of a specific set of
duplicate rows.
Preventing Duplicates from Occurring in a Table discusses how to
prevent duplicates from being added to a table by creating it with a
unique index. However, if you forget to include a unique index when
you create a table, you may discover later that it contains duplicates
and that it’s necessary to apply some sort of duplicate-removal
technique. The catalog_list
table
used earlier is an example of this, because it contains several
instances in which the same person is listed multiple times:
mysql>SELECT * FROM catalog_list ORDER BY last_name, first_name;
+-----------+-------------+--------------------------+
| last_name | first_name | street |
+-----------+-------------+--------------------------+
| Baxter | Wallace | 57 3rd Ave. |
| BAXTER | WALLACE | 57 3rd Ave. |
| Baxter | Wallace | 57 3rd Ave., Apt 102 |
| Brown | Bartholomew | 432 River Run |
| Isaacson | Jim | 515 Fordam St., Apt. 917 |
| McTavish | Taylor | 432 River Run |
| Pinter | Marlene | 9 Sunset Trail |
| Pinter | Marlene | 9 Sunset Trail |
+-----------+-------------+--------------------------+
The table contains redundant entries and it would be a good idea to remove them, to eliminate duplicate mailings and reduce postage costs. To do this, you have several options:
Select the table’s unique rows into another table, and then use that table to replace the original one. The result is to remove the table’s duplicates. This works when “duplicate” means “the entire row is the same as another.”
Add a unique index to the table using ALTER
TABLE
. This operation turns duplicate
rows into unique rows, where “duplicate” means
“the index values are the same.”
You can remove duplicates for a specific set of duplicate
rows by using DELETE
...
LIMIT
n
to remove all but one row.
This recipe discusses each of these duplicate-removal method. When you consider which of them to choose under various circumstances, the applicability of a given method to a specific problem is often determined by several factors:
If a row is considered to duplicate another only if the entire row is the same, one way to eliminate duplicates from a table is to select its unique rows into a new table that has the same structure, and then replace the original table with the new one. To perform table replacement, use the following procedure:
Create a new table that has the same structure as the
original one. CREATE
TABLE
... LIKE
is useful for this (see Cloning a Table):
mysql>CREATE TABLE tmp LIKE catalog_list;
Use INSERT
INTO
... SELECT
DISTINCT
to select the unique rows
from the original table into the new one:
mysql>INSERT INTO tmp SELECT DISTINCT * FROM catalog_list;
Select rows from the tmp
table to verify that the new table
contains no duplicates:
mysql>SELECT * FROM tmp ORDER BY last_name, first_name;
+-----------+-------------+--------------------------+
| last_name | first_name | street |
+-----------+-------------+--------------------------+
| Baxter | Wallace | 57 3rd Ave. |
| Baxter | Wallace | 57 3rd Ave., Apt 102 |
| Brown | Bartholomew | 432 River Run |
| Isaacson | Jim | 515 Fordam St., Apt. 917 |
| McTavish | Taylor | 432 River Run |
| Pinter | Marlene | 9 Sunset Trail |
+-----------+-------------+--------------------------+
After creating the new tmp
table that contains unique rows,
use it to replace the original catalog_list
table:
mysql>DROP TABLE catalog_list;
mysql>RENAME TABLE tmp TO catalog_list;
The effective result of this procedure is that catalog_list
no longer contains
duplicates.
This table-replacement method works in the absence of an index
(although it might be slow for large tables). For tables that
contain duplicate NULL
values, it
removes those duplicates. It does not prevent the occurrence of
duplicates in the future.
This method requires rows to be completely identical for rows
to be considered duplicates. Thus, it treats as distinct those rows
for Wallace Baxter that have slightly different street
values.
If duplicates are defined only with respect to a subset of the
columns in the table, create a new table that has a unique index for
those columns, select rows into it using INSERT
IGNORE
, and then replace the original
table with the new one:
mysql>CREATE TABLE tmp LIKE catalog_list;
mysql>ALTER TABLE tmp ADD PRIMARY KEY (last_name, first_name);
mysql>INSERT IGNORE INTO tmp SELECT * FROM catalog_list;
mysql>SELECT * FROM tmp ORDER BY last_name, first_name;
+-----------+-------------+--------------------------+ | last_name | first_name | street | +-----------+-------------+--------------------------+ | Baxter | Wallace | 57 3rd Ave. | | Brown | Bartholomew | 432 River Run | | Isaacson | Jim | 515 Fordam St., Apt. 917 | | McTavish | Taylor | 432 River Run | | Pinter | Marlene | 9 Sunset Trail | +-----------+-------------+--------------------------+ mysql>DROP TABLE catalog_list;
mysql>RENAME TABLE tmp TO catalog_list;
The unique index prevents rows with duplicate key values from
being inserted into tmp
, and
IGNORE
tells MySQL not to stop
with an error if a duplicate is found. One shortcoming of this
method is that if the indexed columns can contain NULL
values, you must use a UNIQUE
index rather than a PRIMARY
KEY
, in which case the index will not
remove duplicate NULL
keys.
(UNIQUE
indexes allow multiple
NULL
values.) This method does
prevent occurrence of duplicates in the future.
To remove duplicates from a table “in place,” add a unique
index to the table with ALTER
TABLE
, using the IGNORE
keyword to tell it to discard rows
with duplicate key values during the index construction process. The
original catalog_list
table looks
like this without an index:
mysql>SELECT * FROM catalog_list ORDER BY last_name, first_name;
+-----------+-------------+--------------------------+
| last_name | first_name | street |
+-----------+-------------+--------------------------+
| Baxter | Wallace | 57 3rd Ave. |
| BAXTER | WALLACE | 57 3rd Ave. |
| Baxter | Wallace | 57 3rd Ave., Apt 102 |
| Brown | Bartholomew | 432 River Run |
| Isaacson | Jim | 515 Fordam St., Apt. 917 |
| McTavish | Taylor | 432 River Run |
| Pinter | Marlene | 9 Sunset Trail |
| Pinter | Marlene | 9 Sunset Trail |
+-----------+-------------+--------------------------+
Add a unique index, and then check what effect doing so has on the table contents:
mysql>ALTER IGNORE TABLE catalog_list
->ADD PRIMARY KEY (last_name, first_name);
mysql>SELECT * FROM catalog_list ORDER BY last_name, first_name;
+-----------+-------------+--------------------------+ | last_name | first_name | street | +-----------+-------------+--------------------------+ | Baxter | Wallace | 57 3rd Ave. | | Brown | Bartholomew | 432 River Run | | Isaacson | Jim | 515 Fordam St., Apt. 917 | | McTavish | Taylor | 432 River Run | | Pinter | Marlene | 9 Sunset Trail | +-----------+-------------+--------------------------+
If the indexed columns can contain NULL
, you must use a UNIQUE
index rather than a PRIMARY
KEY
. In that case, the index will not
remove duplicate NULL
key values.
In addition to removing existing duplicates, the method prevents the
occurrence of duplicates in the future.
You can use LIMIT
to
restrict the effect of a DELETE
statement to a subset of the rows that it otherwise would delete.
This makes the statement applicable to removing duplicate rows.
Suppose that you have a table t
with the following contents:
+-------+ | color | +-------+ | blue | | green | | blue | | blue | | red | | green | | red | +-------+
The table lists blue
three
times, and green
and red
twice each. To remove the extra
instances of each color, do this:
mysql>DELETE FROM t WHERE color = 'blue' LIMIT 2;
mysql>DELETE FROM t WHERE color = 'green' LIMIT 1;
mysql>DELETE FROM t WHERE color = 'red' LIMIT 1;
mysql>SELECT * FROM t;
+-------+ | color | +-------+ | blue | | green | | red | +-------+
This technique works in the absence of a unique index, and it
eliminates duplicate NULL
values.
It’s handy if you want to remove duplicates only for a specific set
of rows within a table. However, if there are many different sets of
duplicates that you want to remove, this is not a procedure you’d
want to carry out by hand. The process can be automated by using the
techniques discussed earlier in Counting and Identifying Duplicates for determining which values are
duplicated. There, we wrote a make_dup_count_query()
function to
generate the statement needed to count the number of duplicate
values in a given set of columns in a table. The result of that
statement can be used to generate a set of DELETE
... LIMIT
n
statements that remove duplicate rows and leave only unique rows.
The dups directory of the
recipes
distribution contains
code that shows how to generate these statements.
In general, using DELETE
... LIMIT
n
is likely to be slower than removing
duplicates by using a second table or by adding a unique index.
Those methods keep the data on the server side and let the server do
all the work. DELETE
... LIMIT
n
involves a lot of client-server interaction because it uses a
SELECT
statement to retrieve
information about duplicates, followed by several DELETE
statements to remove instances of
duplicated rows. Also, this technique does not prevent duplicates
from occurring in the future.
3.141.47.51