Efficiently Removing Duplicate Rows from Large SQL Tables
Adding a unique constraint to a large table (millions of rows) requires careful handling of duplicate entries. Directly deleting duplicates with a standard SQL query can be extremely slow and resource-intensive.
One straightforward method involves backing up the table, adding the unique constraint, and then restoring the table from the backup. This ensures only unique rows are retained.
However, a more refined approach optimizes the deletion process using a concise SQL query:
<code class="language-sql">DELETE FROM table USING table AS alias WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND table.max_field < alias.max_field;</code>
This query efficiently removes duplicates by comparing rows based on field1
and field2
. It retains the row with the highest max_field
value, effectively eliminating lower-valued duplicates. For example:
<code class="language-sql">DELETE FROM user_accounts USING user_accounts AS ua2 WHERE user_accounts.email = ua2.email AND user_accounts.id < ua2.id;</code>
This example removes duplicate email addresses, preserving the account with the highest id
(presumably the most recently created).
Important Note: The USING
clause, which significantly improves query efficiency by enabling direct row comparison, is a PostgreSQL-specific feature. Alternative methods may be needed for other database systems.
The above is the detailed content of How to Efficiently Remove Duplicate Rows from a Large SQL Table?. For more information, please follow other related articles on the PHP Chinese website!