Find Duplicates
DeDuplica is an advanced yet easy-to-use duplicate detection platform designed to help you identify and manage duplicate records in large datasets. Behind a simple interface, DeDuplica applies probabilistic matching techniques that balance precision and recall without requiring technical expertise.
This document explains how field rank, overall strictness, and supporting configuration options work together, and provides best practices for achieving high‑quality results.
Core Concepts
Fields Used for Comparison
Each deduplication job compares records using selected fields (for example: name, email, address, phone).
Each field contributes evidence toward deciding whether two records represent the same real‑world entity.
Not all fields are equal — some are highly distinctive, others are only weak indicators.
Field Rank (1–10)
Field Rank controls how important a specific field is during comparison.
| Rank | Meaning |
|---|---|
| 1–3 | Weak supporting signal |
| 4–6 | Medium importance |
| 7–8 | Strong distinguishing signal |
| 9–10 | Critical identifier |
Examples
Good high‑rank fields (7–10):
- Email address
- Phone number
- National ID
- Full street address + number
Good medium‑rank fields (4–6):
- Company name
- City
- Date of birth
Low‑rank fields (1–3):
- Country
- Gender
- Job title
- Account status
⚠️ Non‑distinctive fields should never be ranked highly — they create large matching groups and reduce precision.
Overall Strictness (1–10)
Strictness controls how conservative the entire deduplication job is.
| Strictness | Behavior |
|---|---|
| 1–3 | Lenient – finds more potential matches |
| 4–6 | Balanced – recommended default |
| 7–8 | Conservative – fewer false positives |
| 9–10 | Ultra‑strict – near‑exact matches only |
Strictness affects:
- How much combined evidence is required
- How tolerant the system is of partial or fuzzy matches
- Whether weak fields can compensate for strong ones
Important Rule
Strictness does not override poor field choices.
If weak fields are ranked too high, increasing strictness alone will not fix precision issues.
How Rank and Strictness Work Together
- Field Rank controls where evidence comes from
- Strictness controls how much evidence is enough
Example
| Configuration | Outcome |
|---|---|
| Address rank 8 + Strictness 9 | Only same‑street matches |
| Address rank 8 + Strictness 3 | Similar streets may match |
| Country rank 8 | ❌ Poor configuration |
| Country rank 2 + Address rank 8 | ✅ Correct configuration |
Dataset Design Best Practices
Limit the Number of Fields
The maximum number of fields depends on your plan.
- Recommended: 5–10 fields
- Adding more fields usually reduces precision
- Too many weak fields dilute strong signals
Avoid Non‑Distinctive Fields
Fields like:
- Country
- Gender
- Job title
- Marital status
…should be used only as low‑rank supporting fields.
They do not improve uniqueness and may cause excessive false matches.
Scaling for Large Datasets
For large data volumes, splitting deduplication jobs is highly recommended.
Example Strategy
Instead of one large job:
- Run one job per country
- Or per region
- Or per business unit
Benefits
- Smaller query sizes
- Faster execution
- Higher precision
Most duplicates do not exist across countries or regions — splitting improves both performance and quality.
Common Configuration Mistakes
❌ Ranking weak fields too high
❌ Using too many fields
❌ Expecting strictness to fix poor field choice
❌ Running one massive job for global data
❌ Setting ultra‑strict mode with fuzzy fields
Recommended Starter Setup
| Setting | Recommendation |
|---|---|
| Fields | 5–7 |
| Strong ranks | 1–2 fields |
| Weak ranks | 2–3 fields |
| Strictness | 6–8 |
Summary
DeDuplica is designed to be:
- Advanced under the hood
- Simple and safe to configure
- Scalable across data sizes
Focus on good field selection, reasonable ranks, and appropriate strictness — and the system will deliver reliable, explainable results.
For large datasets, splitting jobs is the key to both accuracy and performance.
DeDuplica — Advanced duplicate detection, simplified.