How Duplicate Detection Works

How Duplicate Detection Works

This page explains what happens internally when DeDuplica runs a deduplication job — from raw pair scoring by the matching engine through to the encrypted cluster message delivered to the processing queue.

From Pairs to Clusters: Union-Find Grouping

The underlying matching engine compares records and produces a list of pairs — two record IDs and a probability score representing how likely those two records are the same real-world entity.

DeDuplica does not store or process these as individual pairs. Instead, the agent groups them into clusters using a Union-Find (disjoint set) algorithm:

  • If record A matches record B, and record B matches record C, then A, B, and C all belong to the same cluster — even if A and C were never directly compared.
  • Every edge (pair) in the cluster retains its individual probability score.
  • The cluster’s overall maxProbability is the single highest edge probability across all edges in the cluster.

Why clusters instead of pairs? A pair-based model creates ambiguity about which record is canonical and can generate redundant entries for large matching groups. Clustering produces a single, coherent group with one designated base record and a clear set of subordinates to merge.

Base Record Selection

Once a cluster is formed, one member is designated the base record — the record that survives after a merge and acts as the merge target. The selection strategy is configurable per job:

StrategyBehaviour
LatestThe record with the most recent value in the configured date field is selected as base
OldestThe record with the oldest value in the configured date field is selected as base
Random / FallbackThe record with the highest number of direct edges in the cluster (most connected) is selected

If the date strategy is configured but a record has no value in the date field, that record is skipped for base selection. If all records lack a date value, the highest-connectivity fallback applies.

The remaining records become subordinates. Each subordinate carries a path probability to the base, computed using a max-min (weakest-link) algorithm: the probability of reaching the base is the highest achievable path probability where each path’s strength is limited by its weakest edge.

Subordinates are sorted by probability descending.

Merge Output Computation

Before the cluster message is queued, the agent computes a merge output — a JSON object representing the ideal merged record given the base and all subordinates. This is driven by per-field merge rules configured on the job’s Process Duplicates tab.

Available Merge Strategies

StrategyBehaviour
TakeFromBaseAlways use the base record’s value
TakeFromHighestProbSubordinateUse the value from the highest-probability subordinate
TakeFirstMostPopularModal value across all cluster members; tie-break favours the base record’s value
KeepNonNullFirst non-null, non-empty value: base record first, then subordinates in probability order
AppendConcatenate all distinct non-empty values with a , separator
SumNumeric sum across all cluster members
HigherMaximum numeric value
LowerMinimum numeric value
LongerLongest string (by character count)
ShorterShortest string (by character count)
LatestMost recent date/time value
OldestOldest date/time value

Type coercion: Values are auto-cast where possible — booleans (true/false/yes/no/1/0), integers, and floats are preserved as their native JSON types. Dates support a wide range of ISO 8601 and SQL formats including timezone offsets.

The merge output JSON is encrypted before being placed on the queue (see Encryption below).

Queue Message Size Limit

Azure Storage Queue messages have a hard 64 KB limit. DeDuplica uses a conservative 45 KB target to leave overhead. When a cluster is large and its payload would exceed this limit, subordinates are trimmed from the lowest-probability end until the payload fits. The base record and the highest-probability subordinates are always retained.

The ClusterSize field in the stored duplicate reflects the trimmed count, not the original full cluster size.

Chunked Processing for Large Jobs

To support jobs with large numbers of duplicate clusters without exhausting memory, the agent processes clusters in chunks of 500:

  1. All member record IDs across the chunk are collected.
  2. A single batch database query fetches all required field values for those IDs in one round trip.
  3. Clusters in the chunk are processed and messages queued.
  4. Memory for that chunk is explicitly released before the next chunk begins.
  5. A progress log is written after each chunk (e.g. "Processed 500 of 3200 duplicate clusters").

A single JDBC database connection is opened once at the start of a job and shared across all chunks.

Deduplication of Overlapping Clusters

Because the matching engine can produce overlapping pairs (the same record ID appearing across multiple pre-grouping candidate pairs), DeDuplica handles overlaps on the receiving side:

When a cluster arrives and any of its record IDs already appear in a Pending duplicate from the same job execution, those existing Pending duplicates are deleted before the new cluster is created. This ensures no two pending duplicates from the same execution can ever share a record ID, preventing double-processing.

Locked Cluster Suppression

If a cluster arrives where the base record was previously part of a Locked duplicate (a duplicate the user decided never to merge), any subordinates from the incoming cluster that overlap with the locked cluster’s subordinates are silently removed. If all subordinates are removed, the entire incoming cluster is discarded.

See Duplicates — Locked status for more detail.

Cancelled Cluster Suppression

If an identical cluster was already created and then Cancelled by the user, it is not re-created on subsequent job runs. The fingerprint check compares:

  • Same base record ID
  • Same probability (rounded to 5 decimal places)
  • Same set of subordinate IDs (order-independent)
  • Same MergeOutputJson string

If all four match, the cluster is silently dropped.

Test Run Behaviour

When a job is run in test mode, no duplicates are created in the database. Instead, a log entry is written for each detected cluster showing:

  • The base record ID
  • A numbered list of all subordinates with their IDs and probabilities
  • The max cluster probability
  • The full merge output JSON

This lets you verify deduplication configuration and merge rules before committing to a live run. See Testing a Job.

Double Encryption of Cluster Data

All record field data flowing through the cluster pipeline is encrypted in transit. DeDuplica uses a two-layer encryption model:

Outer Layer — SUBSCRIPTION_ENCRYPTION_KEY (mandatory)

Applied by the agent before placing the message on the Azure Storage queue. Stripped by DeDuplica’s C# backend when the message is received. This ensures data in the queue is never visible to the queue infrastructure itself.

Inner Layer — CLIENT_ENCRYPTION_KEY (optional, Enterprise)

When configured, this is applied first (before the outer layer). DeDuplica’s backend does not strip this layer — the still-encrypted value passes through to the action agent and into the MergeOutputJson field of webhook payloads. Only your receiving infrastructure (which holds CLIENT_ENCRYPTION_KEY) can decrypt and read the plaintext field values. This means DeDuplica itself never sees the record data.

ConfigurationDeDuplica can read merge dataWebhook MergeOutputJson
CLIENT_ENCRYPTION_KEY not setYes (after transit decryption)Plaintext JSON string — parse directly
CLIENT_ENCRYPTION_KEY set (valid AES key)No — inner layer remainsEncrypted — must decrypt before parsing

See Local Agent — Client-Controlled Encryption for key format requirements and generation instructions.

Do not change encryption keys after duplicates have been stored. Existing records are encrypted with the key that was active at creation time. Rotating a key without first resolving all pending duplicates will make those records permanently unreadable in webhooks and the DeDuplica UI.