How to Remove Duplicate Records from SQL Server, PostgreSQL and MySQL

March 31, 2026

Removing duplicate records from a relational database is one of the most common data quality tasks. This article covers both the manual SQL approach (for simple cases or one-off cleanups) and the systematic tool-based approach (for production databases with ongoing needs).

The Manual SQL Approach

Step 1: Identify Duplicates

Before deleting anything, find out what you’re dealing with. This query identifies groups of duplicate rows with identical values across specified columns:

-- Works in SQL Server, PostgreSQL, and MySQL
SELECT
    email,
    first_name,
    last_name,
    COUNT(*) AS duplicate_count
FROM contacts
GROUP BY email, first_name, last_name
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC;

Replace the column names and table with your actual schema.

Step 2: Choose Which Row to Keep

Use a CTE with ROW_NUMBER() to rank duplicates and mark which ones to delete. In this example, we keep the oldest record (lowest id) and mark later ones for removal:

SQL Server / PostgreSQL:

WITH ranked AS (
    SELECT
        id,
        ROW_NUMBER() OVER (
            PARTITION BY email, first_name, last_name
            ORDER BY id ASC
        ) AS rn
    FROM contacts
)
DELETE FROM contacts
WHERE id IN (
    SELECT id FROM ranked WHERE rn > 1
);

MySQL (requires subquery workaround):

DELETE c FROM contacts c
INNER JOIN (
    SELECT MIN(id) AS keep_id, email, first_name, last_name
    FROM contacts
    GROUP BY email, first_name, last_name
    HAVING COUNT(*) > 1
) dup ON c.email = dup.email
    AND c.first_name = dup.first_name
    AND c.last_name = dup.last_name
    AND c.id != dup.keep_id;

Limitations of the Manual Approach

The SQL approach above handles exact duplicates — rows where specified columns are character-for-character identical. It does not handle:

Fuzzy duplicates: “Acme Corp” and “ACME Corporation” — different strings, same company
Partial duplicates: same person, different email addresses in different records
Multi-table scenarios: duplicate contacts linked to different accounts that should be merged
Related record handling: what happens to the orders, notes, or activities attached to the deleted row?

For exact-match deduplication of a simple table, the SQL approach is adequate. For anything more complex, a dedicated tool is more reliable and significantly safer.

The Systematic Tool-Based Approach

For production deduplication — especially with fuzzy matching, related records, or recurring runs — a dedicated tool provides workflow guardrails that raw SQL doesn’t.

Using DeDuplica

DeDuplica connects directly to SQL Server, PostgreSQL and MySQL without requiring any data export. The workflow:

1. Connect your database

In DeDuplica, go to Connections and add a new connection for your database engine. You’ll provide the host, port, database name, and a read/write user credential. DeDuplica only needs read access for the Find step; write access is required for the Process step.

See: SQL Server connection guide, PostgreSQL guide, MySQL guide.

2. Create a job with a Source Definition

Define which table to scan and which fields to use in matching. For a contacts table, you might configure:

email: Exact Comparator (high confidence signal)
last_name: Person Name Comparator, threshold 85%
first_name: Person Name Comparator, threshold 80%

Add filters to restrict scope if needed (e.g., WHERE created_date > '2024-01-01' for incremental processing).

See: source definition documentation.

3. Run a Find Duplicates job

The Find job scans your table, applies the matching rules, and stores identified duplicate groups. No records are modified. Review the output — you’ll see every identified pair or group with their similarity scores.

4. Review and optionally test first

If you’re not sure about your matching rule thresholds, use the Testing a Job feature to run against a 5,000-row sample first. Inspect the results. Adjust thresholds if you’re seeing too many false positives or too many missed duplicates.

5. Run a Process Duplicates job

Once satisfied with the Find results, run a Process Duplicates job. Configure:

Which record becomes the base (e.g., the oldest by creation date, or the one with the most related records)
How conflicting field values are resolved
What happens to related records attached to subordinates

The process job executes the merges or deletions (directly where supported or by webhooks), logs every action, and stores a full run report.

The Key Advantage: Separation of Find and Process

The most important difference from running a DELETE statement is that you see what you’re deleting before it happens. A raw SQL delete is immediate and permanent (unless you’re in a transaction). The DeDuplica workflow gives you a structured review step between identification and action.

For production databases with live users, this separation is not optional — it is how you avoid data loss.

Setting Up Recurring Deduplication

A one-off cleanup is a starting point. For databases in active use, duplicates will re-accumulate. Scheduling a recurring weekly or daily Find + Process job keeps the data clean continuously without manual intervention.

DeDuplica supports this through job scheduling. Schedule the Find job to run (e.g., every Monday at 03:00), and optionally schedule the Process job to automatically resolve high-confidence matches without waiting for manual review.

Quick Reference: Which Approach for Which Scenario

Scenario	Approach
One-time cleanup, exact match only, no related records	Manual SQL
One-time cleanup with fuzzy matching	DeDuplica Find + Process job
Recurring automated deduplication	DeDuplica with scheduling
Production database with related records to preserve	DeDuplica (handles re-parenting)
Compliance-sensitive data that can’t leave network	DeDuplica with local agent

Connect your SQL Server, PostgreSQL, or MySQL database to DeDuplica and run your first deduplication scan for free. Start here — no credit card required.

DeDuplica vs OpenRefine: Open Source vs. Purpose-Built SaaS for Data Deduplication Dedicated Deduplication Tool vs. Manual Scripting: Which Approach Is Right for Your Team?