How Duplicate Detection Works
This page explains what happens internally when DeDuplica runs a deduplication job — from raw pair scoring by the matching engine through to the encrypted cluster message delivered to the processing queue.
From Pairs to Clusters: Union-Find Grouping
The underlying matching engine compares records and produces a list of pairs — two record IDs and a probability score representing how likely those two records are the same real-world entity.
DeDuplica does not store or process these as individual pairs. Instead, the agent groups them into clusters using a Union-Find (disjoint set) algorithm:
- If record A matches record B, and record B matches record C, then A, B, and C all belong to the same cluster — even if A and C were never directly compared.
- Every edge (pair) in the cluster retains its individual probability score.
- The cluster’s overall
maxProbabilityis the single highest edge probability across all edges in the cluster.
Why clusters instead of pairs? A pair-based model creates ambiguity about which record is canonical and can generate redundant entries for large matching groups. Clustering produces a single, coherent group with one designated base record and a clear set of subordinates to merge.
Base Record Selection
Once a cluster is formed, one member is designated the base record — the record that survives after a merge and acts as the merge target. The selection strategy is configurable per job:
| Strategy | Behaviour |
|---|---|
| Latest | The record with the most recent value in the configured date field is selected as base |
| Oldest | The record with the oldest value in the configured date field is selected as base |
| Random / Fallback | The record with the highest number of direct edges in the cluster (most connected) is selected |
If the date strategy is configured but a record has no value in the date field, that record is skipped for base selection. If all records lack a date value, the highest-connectivity fallback applies.
The remaining records become subordinates. Each subordinate carries a path probability to the base, computed using a max-min (weakest-link) algorithm: the probability of reaching the base is the highest achievable path probability where each path’s strength is limited by its weakest edge.
Subordinates are sorted by probability descending.
Merge Output Computation
Before the cluster message is queued, the agent computes a merge output — a JSON object representing the ideal merged record given the base and all subordinates. This is driven by per-field merge rules configured on the job’s Process Duplicates tab.
Available Merge Strategies
| Strategy | Behaviour |
|---|---|
TakeFromBase | Always use the base record’s value |
TakeFromHighestProbSubordinate | Use the value from the highest-probability subordinate |
TakeFirstMostPopular | Modal value across all cluster members; tie-break favours the base record’s value |
KeepNonNull | First non-null, non-empty value: base record first, then subordinates in probability order |
Append | Concatenate all distinct non-empty values with a , separator |
Sum | Numeric sum across all cluster members |
Higher | Maximum numeric value |
Lower | Minimum numeric value |
Longer | Longest string (by character count) |
Shorter | Shortest string (by character count) |
Latest | Most recent date/time value |
Oldest | Oldest date/time value |
Type coercion: Values are auto-cast where possible — booleans (true/false/yes/no/1/0), integers, and floats are preserved as their native JSON types. Dates support a wide range of ISO 8601 and SQL formats including timezone offsets.
The merge output JSON is encrypted before being placed on the queue (see Encryption below).
Queue Message Size Limit
Azure Storage Queue messages have a hard 64 KB limit. DeDuplica uses a conservative 45 KB target to leave overhead. When a cluster is large and its payload would exceed this limit, subordinates are trimmed from the lowest-probability end until the payload fits. The base record and the highest-probability subordinates are always retained.
The ClusterSize field in the stored duplicate reflects the trimmed count, not the original full cluster size.
Chunked Processing for Large Jobs
To support jobs with large numbers of duplicate clusters without exhausting memory, the agent processes clusters in chunks of 500:
- All member record IDs across the chunk are collected.
- A single batch database query fetches all required field values for those IDs in one round trip.
- Clusters in the chunk are processed and messages queued.
- Memory for that chunk is explicitly released before the next chunk begins.
- A progress log is written after each chunk (e.g.
"Processed 500 of 3200 duplicate clusters").
A single JDBC database connection is opened once at the start of a job and shared across all chunks.
Deduplication of Overlapping Clusters
Because the matching engine can produce overlapping pairs (the same record ID appearing across multiple pre-grouping candidate pairs), DeDuplica handles overlaps on the receiving side:
When a cluster arrives and any of its record IDs already appear in a Pending duplicate from the same job execution, those existing Pending duplicates are deleted before the new cluster is created. This ensures no two pending duplicates from the same execution can ever share a record ID, preventing double-processing.
Locked Cluster Suppression
If a cluster arrives where the base record was previously part of a Locked duplicate (a duplicate the user decided never to merge), any subordinates from the incoming cluster that overlap with the locked cluster’s subordinates are silently removed. If all subordinates are removed, the entire incoming cluster is discarded.
See Duplicates — Locked status for more detail.
Cancelled Cluster Suppression
If an identical cluster was already created and then Cancelled by the user, it is not re-created on subsequent job runs. The fingerprint check compares:
- Same base record ID
- Same probability (rounded to 5 decimal places)
- Same set of subordinate IDs (order-independent)
- Same
MergeOutputJsonstring
If all four match, the cluster is silently dropped.
Test Run Behaviour
When a job is run in test mode, no duplicates are created in the database. Instead, a log entry is written for each detected cluster showing:
- The base record ID
- A numbered list of all subordinates with their IDs and probabilities
- The max cluster probability
- The full merge output JSON
This lets you verify deduplication configuration and merge rules before committing to a live run. See Testing a Job.
Double Encryption of Cluster Data
All record field data flowing through the cluster pipeline is encrypted in transit. DeDuplica uses a two-layer encryption model:
Outer Layer — SUBSCRIPTION_ENCRYPTION_KEY (mandatory)
Applied by the agent before placing the message on the Azure Storage queue. Stripped by DeDuplica’s C# backend when the message is received. This ensures data in the queue is never visible to the queue infrastructure itself.
Inner Layer — CLIENT_ENCRYPTION_KEY (optional, Enterprise)
When configured, this is applied first (before the outer layer). DeDuplica’s backend does not strip this layer — the still-encrypted value passes through to the action agent and into the MergeOutputJson field of webhook payloads. Only your receiving infrastructure (which holds CLIENT_ENCRYPTION_KEY) can decrypt and read the plaintext field values. This means DeDuplica itself never sees the record data.
| Configuration | DeDuplica can read merge data | Webhook MergeOutputJson |
|---|---|---|
CLIENT_ENCRYPTION_KEY not set | Yes (after transit decryption) | Plaintext JSON string — parse directly |
CLIENT_ENCRYPTION_KEY set (valid AES key) | No — inner layer remains | Encrypted — must decrypt before parsing |
See Local Agent — Client-Controlled Encryption for key format requirements and generation instructions.
Do not change encryption keys after duplicates have been stored. Existing records are encrypted with the key that was active at creation time. Rotating a key without first resolving all pending duplicates will make those records permanently unreadable in webhooks and the DeDuplica UI.