Find Duplicates

Find Duplicates

DeDuplica is an advanced yet easy-to-use duplicate detection platform designed to help you identify and manage duplicate records in large datasets. Behind a simple interface, DeDuplica applies probabilistic matching techniques that balance precision and recall without requiring technical expertise.

This document explains how field rank, overall strictness, and supporting configuration options work together, and provides best practices for achieving high‑quality results.

Core Concepts

Fields Used for Comparison

Each deduplication job compares records using selected fields (for example: name, email, address, phone).
Each field contributes evidence toward deciding whether two records represent the same real‑world entity.

Not all fields are equal — some are highly distinctive, others are only weak indicators.

Field Rank (1–10)

Field Rank controls how important a specific field is during comparison.

RankMeaning
1–3Weak supporting signal
4–6Medium importance
7–8Strong distinguishing signal
9–10Critical identifier

Examples

Good high‑rank fields (7–10):

  • Email address
  • Phone number
  • National ID
  • Full street address + number

Good medium‑rank fields (4–6):

  • Company name
  • City
  • Date of birth

Low‑rank fields (1–3):

  • Country
  • Gender
  • Job title
  • Account status

⚠️ Non‑distinctive fields should never be ranked highly — they create large matching groups and reduce precision.

Overall Strictness (1–10)

Strictness controls how conservative the entire deduplication job is.

StrictnessBehavior
1–3Lenient – finds more potential matches
4–6Balanced – recommended default
7–8Conservative – fewer false positives
9–10Ultra‑strict – near‑exact matches only

Strictness affects:

  • How much combined evidence is required
  • How tolerant the system is of partial or fuzzy matches
  • Whether weak fields can compensate for strong ones

Important Rule

Strictness does not override poor field choices.
If weak fields are ranked too high, increasing strictness alone will not fix precision issues.

How Rank and Strictness Work Together

  • Field Rank controls where evidence comes from
  • Strictness controls how much evidence is enough

Example

ConfigurationOutcome
Address rank 8 + Strictness 9Only same‑street matches
Address rank 8 + Strictness 3Similar streets may match
Country rank 8❌ Poor configuration
Country rank 2 + Address rank 8✅ Correct configuration

Dataset Design Best Practices

Limit the Number of Fields

The maximum number of fields depends on your plan.

  • Recommended: 5–10 fields
  • Adding more fields usually reduces precision
  • Too many weak fields dilute strong signals

Avoid Non‑Distinctive Fields

Fields like:

  • Country
  • Gender
  • Job title
  • Marital status

…should be used only as low‑rank supporting fields.

They do not improve uniqueness and may cause excessive false matches.

Scaling for Large Datasets

For large data volumes, splitting deduplication jobs is highly recommended.

Example Strategy

Instead of one large job:

  • Run one job per country
  • Or per region
  • Or per business unit

Benefits

  • Smaller query sizes
  • Faster execution
  • Higher precision

Most duplicates do not exist across countries or regions — splitting improves both performance and quality.


Common Configuration Mistakes

❌ Ranking weak fields too high
❌ Using too many fields
❌ Expecting strictness to fix poor field choice
❌ Running one massive job for global data
❌ Setting ultra‑strict mode with fuzzy fields

Recommended Starter Setup

SettingRecommendation
Fields5–7
Strong ranks1–2 fields
Weak ranks2–3 fields
Strictness6–8

Summary

DeDuplica is designed to be:

  • Advanced under the hood
  • Simple and safe to configure
  • Scalable across data sizes

Focus on good field selection, reasonable ranks, and appropriate strictness — and the system will deliver reliable, explainable results.

For large datasets, splitting jobs is the key to both accuracy and performance.


DeDuplica — Advanced duplicate detection, simplified.