Find Duplicates
DeDuplica helps you identify duplicate records in your data using an intelligent, probability-based matching engine. While the underlying technology is advanced, configuring it is intentionally simple and intuitive.
This guide explains how field ranking and strictness work together, how to choose the right settings, and what to avoid.
π How Duplicate Detection Works (Conceptually)
DeDuplica evaluates records by comparing selected fields and estimating how likely two records represent the same real-world entity.
Each comparison produces a match score between 0 and 1:
0= definitely not the same1= definitely the same
A record pair is considered a duplicate if its score exceeds the threshold, which is determined automatically based on your settings.
You control this behavior using:
- Field Rank (importance of each field)
- Strictness (overall conservativeness of the matching)
π§© Fields and Rank (1β10)
Each field you add can be assigned a rank from 1 to 10.
What Rank Means
Rank defines how strongly a field influences duplicate detection.
| Rank | Meaning | Typical Use |
|---|---|---|
| 1β2 | Very weak signal | Country, language, currency |
| 3β4 | Weak signal | City, region |
| 5β6 | Medium signal | Postal code, company type |
| 7β8 | Strong signal | Street address, phone number |
| 9β10 | Decisive signal | Email, national ID, registration number |
A Better Example (Realistic Scenario)
Imagine you are deduplicating customer records using:
| Field | Rank |
|---|---|
| 9 | |
| Phone number | 8 |
| Last name | 6 |
| City | 3 |
How DeDuplica interprets this:
- Matching email almost guarantees the same person
- Matching phone is strong but not absolute
- Matching last name helps, but is common
- Matching city alone means very little
Two records with the same city and last name will not be considered duplicates. Two records with the same email will almost certainly be considered duplicates.
This reflects how humans naturally reason about identity.
π Strictness (1β10)
Strictness controls how conservative the algorithm is overall.
It answers the question:
βHow sure should the system be before calling something a duplicate?β
Strictness Levels
| Strictness | Behavior |
|---|---|
| 1β2 | Very lenient β finds many possible duplicates |
| 3β4 | Lenient β useful for exploration and review |
| 5β6 | Balanced β recommended starting point |
| 7β8 | Strict β high confidence matches |
| 9β10 | Very strict β near-certain duplicates only |
What Strictness Actually Does
Strictness adjusts the decision threshold:
- Low strictness β lower threshold β more matches
- High strictness β higher threshold β fewer matches
With Strictness = 10, DeDuplica will only flag duplicates when strong fields agree convincingly.
π§ How Rank and Strictness Work Together
Think of the process as two layers:
- Field Rank decides what evidence matters
- Strictness decides how much evidence is enough
Example: Business Deduplication
| Field | Rank |
|---|---|
| Company name | 7 |
| VAT number | 10 |
| Country | 2 |
| Strictness | 9 |
Results:
- Same country β ignored
- Similar company name β not enough
- Same VAT number β duplicate
- Same name + same country β likely duplicate
This mirrors real-world decision making.
β Good Practices
β Use High Ranks for Truly Identifying Fields
Good candidates for rank 7β10:
- Phone number
- Legal identifiers
- Registration numbers
β Use Low Ranks for Common or Expected Values
Good candidates for rank 1β4:
- Country
- Language
- Status
- Category
β Increase Strictness for Automation
If duplicates will be merged automatically, use:
- Strictness 8β10
β Lower Strictness for Review Workflows
If duplicates are reviewed by humans, use:
- Strictness 4β6
β οΈ What to Avoid
β Donβt Give Weak Fields High Rank
Example:
- Country = rank 8
This can cause false positives because many records naturally share the same country.
β Donβt Set Every Field to Rank 10
If all fields are βdecisiveβ, none of them truly are. Use high ranks sparingly.
β Donβt Use Low Strictness for High-Risk Actions
Lenient settings can surface many possible duplicates but should not be used for automatic merges.
π§ͺ Recommended Starting Templates
Person Deduplication
- Email: rank 9β10
- Phone: rank 8
- Name: rank 6
- Strictness: 6β8
Company Deduplication
- Legal name: rank 7β8
- Registration number: rank 9β10
- Country: rank 2β3
- Strictness: 8β10
Address Deduplication
- Full address: rank 7β8
- Postal code: rank 5
- City: rank 3
- Country: rank 2
- Strictness: 7β9
β¨ Why DeDuplica Is Powerful
- Advanced probabilistic matching
- Automatic threshold calibration
- Flexible per-field importance
- Safe defaults with expert-level control
All of this is delivered through simple sliders and dropdowns, so you can focus on your dataβnot algorithms.