How to Remove Duplicate Records from SQL Server, PostgreSQL and MySQL
Removing duplicate records from a relational database is one of the most common data quality tasks. This article covers both the manual SQL approach (for simple cases or one-off cleanups) and the systematic tool-based approach (for production databases with ongoing needs).
The Manual SQL Approach
Step 1: Identify Duplicates
Before deleting anything, find out what you’re dealing with. This query identifies groups of duplicate rows with identical values across specified columns:
-- Works in SQL Server, PostgreSQL, and MySQL
SELECT
email,
first_name,
last_name,
COUNT(*) AS duplicate_count
FROM contacts
GROUP BY email, first_name, last_name
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC;Replace the column names and table with your actual schema.
Step 2: Choose Which Row to Keep
Use a CTE with ROW_NUMBER() to rank duplicates and mark which ones to delete. In this example, we keep the oldest record (lowest id) and mark later ones for removal:
SQL Server / PostgreSQL:
WITH ranked AS (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY email, first_name, last_name
ORDER BY id ASC
) AS rn
FROM contacts
)
DELETE FROM contacts
WHERE id IN (
SELECT id FROM ranked WHERE rn > 1
);MySQL (requires subquery workaround):
DELETE c FROM contacts c
INNER JOIN (
SELECT MIN(id) AS keep_id, email, first_name, last_name
FROM contacts
GROUP BY email, first_name, last_name
HAVING COUNT(*) > 1
) dup ON c.email = dup.email
AND c.first_name = dup.first_name
AND c.last_name = dup.last_name
AND c.id != dup.keep_id;Limitations of the Manual Approach
The SQL approach above handles exact duplicates — rows where specified columns are character-for-character identical. It does not handle:
- Fuzzy duplicates: “Acme Corp” and “ACME Corporation” — different strings, same company
- Partial duplicates: same person, different email addresses in different records
- Multi-table scenarios: duplicate contacts linked to different accounts that should be merged
- Related record handling: what happens to the orders, notes, or activities attached to the deleted row?
For exact-match deduplication of a simple table, the SQL approach is adequate. For anything more complex, a dedicated tool is more reliable and significantly safer.
The Systematic Tool-Based Approach
For production deduplication — especially with fuzzy matching, related records, or recurring runs — a dedicated tool provides workflow guardrails that raw SQL doesn’t.
Using DeDuplica
DeDuplica connects directly to SQL Server, PostgreSQL and MySQL without requiring any data export. The workflow:
1. Connect your database
In DeDuplica, go to Connections and add a new connection for your database engine. You’ll provide the host, port, database name, and a read/write user credential. DeDuplica only needs read access for the Find step; write access is required for the Process step.
See: SQL Server connection guide, PostgreSQL guide, MySQL guide.
2. Create a job with a Source Definition
Define which table to scan and which fields to use in matching. For a contacts table, you might configure:
email: Exact Comparator (high confidence signal)last_name: Person Name Comparator, threshold 85%first_name: Person Name Comparator, threshold 80%
Add filters to restrict scope if needed (e.g., WHERE created_date > '2024-01-01' for incremental processing).
See: source definition documentation.
3. Run a Find Duplicates job
The Find job scans your table, applies the matching rules, and stores identified duplicate groups. No records are modified. Review the output — you’ll see every identified pair or group with their similarity scores.
4. Review and optionally test first
If you’re not sure about your matching rule thresholds, use the Testing a Job feature to run against a 5,000-row sample first. Inspect the results. Adjust thresholds if you’re seeing too many false positives or too many missed duplicates.
5. Run a Process Duplicates job
Once satisfied with the Find results, run a Process Duplicates job. Configure:
- Which record becomes the base (e.g., the oldest by creation date, or the one with the most related records)
- How conflicting field values are resolved
- What happens to related records attached to subordinates
The process job executes the merges or deletions (directly where supported or by webhooks), logs every action, and stores a full run report.
The Key Advantage: Separation of Find and Process
The most important difference from running a DELETE statement is that you see what you’re deleting before it happens. A raw SQL delete is immediate and permanent (unless you’re in a transaction). The DeDuplica workflow gives you a structured review step between identification and action.
For production databases with live users, this separation is not optional — it is how you avoid data loss.
Setting Up Recurring Deduplication
A one-off cleanup is a starting point. For databases in active use, duplicates will re-accumulate. Scheduling a recurring weekly or daily Find + Process job keeps the data clean continuously without manual intervention.
DeDuplica supports this through job scheduling. Schedule the Find job to run (e.g., every Monday at 03:00), and optionally schedule the Process job to automatically resolve high-confidence matches without waiting for manual review.
Quick Reference: Which Approach for Which Scenario
| Scenario | Approach |
|---|---|
| One-time cleanup, exact match only, no related records | Manual SQL |
| One-time cleanup with fuzzy matching | DeDuplica Find + Process job |
| Recurring automated deduplication | DeDuplica with scheduling |
| Production database with related records to preserve | DeDuplica (handles re-parenting) |
| Compliance-sensitive data that can’t leave network | DeDuplica with local agent |
Connect your SQL Server, PostgreSQL, or MySQL database to DeDuplica and run your first deduplication scan for free. Start here — no credit card required.