Find Duplicates

Find Duplicates

DeDuplica helps you identify duplicate records in your data using an intelligent, probability-based matching engine. While the underlying technology is advanced, configuring it is intentionally simple and intuitive.

This guide explains how field ranking and strictness work together, how to choose the right settings, and what to avoid.


πŸ” How Duplicate Detection Works (Conceptually)

DeDuplica evaluates records by comparing selected fields and estimating how likely two records represent the same real-world entity.

Each comparison produces a match score between 0 and 1:

  • 0 = definitely not the same
  • 1 = definitely the same

A record pair is considered a duplicate if its score exceeds the threshold, which is determined automatically based on your settings.

You control this behavior using:

  • Field Rank (importance of each field)
  • Strictness (overall conservativeness of the matching)

🧩 Fields and Rank (1–10)

Each field you add can be assigned a rank from 1 to 10.

What Rank Means

Rank defines how strongly a field influences duplicate detection.

RankMeaningTypical Use
1–2Very weak signalCountry, language, currency
3–4Weak signalCity, region
5–6Medium signalPostal code, company type
7–8Strong signalStreet address, phone number
9–10Decisive signalEmail, national ID, registration number

A Better Example (Realistic Scenario)

Imagine you are deduplicating customer records using:

FieldRank
Email9
Phone number8
Last name6
City3

How DeDuplica interprets this:

  • Matching email almost guarantees the same person
  • Matching phone is strong but not absolute
  • Matching last name helps, but is common
  • Matching city alone means very little

Two records with the same city and last name will not be considered duplicates. Two records with the same email will almost certainly be considered duplicates.

This reflects how humans naturally reason about identity.


🎚 Strictness (1–10)

Strictness controls how conservative the algorithm is overall.

It answers the question:

β€œHow sure should the system be before calling something a duplicate?”

Strictness Levels

StrictnessBehavior
1–2Very lenient – finds many possible duplicates
3–4Lenient – useful for exploration and review
5–6Balanced – recommended starting point
7–8Strict – high confidence matches
9–10Very strict – near-certain duplicates only

What Strictness Actually Does

Strictness adjusts the decision threshold:

  • Low strictness β†’ lower threshold β†’ more matches
  • High strictness β†’ higher threshold β†’ fewer matches

With Strictness = 10, DeDuplica will only flag duplicates when strong fields agree convincingly.


🧠 How Rank and Strictness Work Together

Think of the process as two layers:

  1. Field Rank decides what evidence matters
  2. Strictness decides how much evidence is enough

Example: Business Deduplication

FieldRank
Company name7
VAT number10
Country2
Strictness9

Results:

  • Same country β†’ ignored
  • Similar company name β†’ not enough
  • Same VAT number β†’ duplicate
  • Same name + same country β†’ likely duplicate

This mirrors real-world decision making.


βœ… Good Practices

βœ” Use High Ranks for Truly Identifying Fields

Good candidates for rank 7–10:

  • Email
  • Phone number
  • Legal identifiers
  • Registration numbers

βœ” Use Low Ranks for Common or Expected Values

Good candidates for rank 1–4:

  • Country
  • Language
  • Status
  • Category

βœ” Increase Strictness for Automation

If duplicates will be merged automatically, use:

  • Strictness 8–10

βœ” Lower Strictness for Review Workflows

If duplicates are reviewed by humans, use:

  • Strictness 4–6

⚠️ What to Avoid

❌ Don’t Give Weak Fields High Rank

Example:

  • Country = rank 8
    This can cause false positives because many records naturally share the same country.

❌ Don’t Set Every Field to Rank 10

If all fields are β€œdecisive”, none of them truly are. Use high ranks sparingly.

❌ Don’t Use Low Strictness for High-Risk Actions

Lenient settings can surface many possible duplicates but should not be used for automatic merges.


πŸ§ͺ Recommended Starting Templates

Person Deduplication

  • Email: rank 9–10
  • Phone: rank 8
  • Name: rank 6
  • Strictness: 6–8

Company Deduplication

  • Legal name: rank 7–8
  • Registration number: rank 9–10
  • Country: rank 2–3
  • Strictness: 8–10

Address Deduplication

  • Full address: rank 7–8
  • Postal code: rank 5
  • City: rank 3
  • Country: rank 2
  • Strictness: 7–9

✨ Why DeDuplica Is Powerful

  • Advanced probabilistic matching
  • Automatic threshold calibration
  • Flexible per-field importance
  • Safe defaults with expert-level control

All of this is delivered through simple sliders and dropdowns, so you can focus on your dataβ€”not algorithms.