Duplicates
Duplicates Overview
- When duplicates are found they are added to the Duplicates/Pending duplicates table.
- Duplicates can be added as Pending (manual review) or completed automatically if rules and connection support automatic merges.
- Each duplicate entry represents a cluster: one base record and one or more subordinate records found to be duplicates of it.
Duplicate Record Fields
Each duplicate entry exposes the following fields:
| Field | Description |
|---|---|
| BaseRecordId | Identifier of the base (master) record in the cluster |
| ClusterSize | Total number of records in the cluster (base + all subordinates) |
| MaxProbability | The highest probability score among all subordinate-to-base pairings in the cluster |
| Subordinates | List of subordinate records, each with their own record ID, data snapshot, and probability score relative to the base |
Duplicate Actions
Understanding what happens to duplicates in different states helps you manage your deduplication workflow effectively.
Pending
When a duplicate cluster is waiting for your review, it stays in “Pending” status. If a later job produces a cluster with the same base record ID, DeDuplica will update the cluster details (probability, cluster size, and merge data) and replace all existing subordinate rows with the new set — but keep the original execution ID. This means you won’t see multiple pending clusters for the same base record — just one that is updated with the latest cluster information.
Note: the subordinates don’t need to match the previous set at all. The base record ID is the only key used when deciding whether to update a Pending cluster.
Completed
When you complete a duplicate, the action you’ve configured happens:
- Automatic merge (if your connection supports it)
- Webhook triggered to notify other systems (if enabled for job)
Once a duplicate is completed, it’s locked in for your records. If a future job finds the same pair again, DeDuplica will create a fresh duplicate record—because your data may have changed and it might be worth reviewing again.
Cancelled
This means you’ve decided these aren’t really duplicates. Maybe they looked similar but turned out to be different records.
When the next job runs:
- If it finds the exact same cluster — meaning: same base record ID, same probability (rounded to 5 decimal places), same set of subordinate IDs, and same merge output — it won’t create a new duplicate (we trust your judgment).
- If any of those four properties has changed, it will create a new duplicate for you to review — because something changed and it might be worth a second look.
Locked
Locking a duplicate permanently excludes those records from future matching. This tells DeDuplica: “No subordinate in this cluster will ever be linked to this base record again.”
Any incoming cluster whose subordinate record IDs appear in a Locked cluster with the same base record ID will have those subordinates silently removed before processing. If all subordinates are removed, the entire incoming cluster is dropped.
Even if the data in those records changes completely in future jobs, DeDuplica will not flag them as duplicates — the lock stays in place. This is useful when you know two or more records look similar but should always remain separate.
The only way to allow them to be detected as duplicates again is to delete the locked duplicate record.
Deleted
Sometimes things go wrong, or you want to clear out a mistake. Deleting a duplicate removes it from your list.
DeDuplica treats it like it never existed. If a future job finds the same cluster again, it will create a fresh duplicate record just like the first time.
Cluster Processing Decision Flow
When an incoming cluster is received, DeDuplica applies the following steps in order:
Step 1 — Locked check
Any incoming subordinate whose record ID appears in a Locked cluster with the same base record ID is silently removed from the list. If all subordinates are filtered out, the whole cluster is dropped.
Step 2 — Pending update check
Matching is done solely on the base record ID:
- If a Pending cluster with that base record ID exists:
- All its existing subordinate rows are deleted and replaced with the incoming ones
- Probability, ClusterSize, and MergeOutputJson are overwritten
- No new Duplicate row is created
The subordinates don’t need to match the previous set — a Pending cluster for base A is updated regardless of how many subordinates changed or even if entirely different subordinates arrive.
Step 3 — Cancelled suppression check
If no Pending match is found, DeDuplica checks for an exact Cancelled match using a fingerprint of all four of:
- Same base record ID
- Same probability (rounded to 5 decimal places)
- Same set of subordinate IDs (order-independent)
- Same MergeOutputJson string
If all four match, the cluster is silently dropped.
Step 4 — Create new
If nothing matched above, a brand new Duplicate row is created with status Pending or Completed depending on the job’s addToPending action configuration.