Beyond Exact Matching: How Fuzzy Matching Finds Hidden Duplicates
The most common deduplication query looks something like this:
SELECT email, COUNT(*) FROM contacts GROUP BY email HAVING COUNT(*) > 1;It is a reasonable starting point. But it will catch perhaps 20% of your actual duplicates — the obvious ones where someone entered the same email address twice. The other 80% are hiding just out of reach.
Why Exact Matching Falls Short
Real-world data has noise. Names get abbreviated. Addresses are formatted differently by different systems. Companies change names. People move. Consider:
| Record A | Record B | Exact match? | Actually duplicate? |
|---|---|---|---|
| john.smith@acme.com | j.smith@acme.com | No | Probably |
| Acme Corporation Ltd | ACME Corp. Ltd | No | Almost certainly |
| 142 High Street, London | 142 High St London | No | Yes |
| María García | Maria Garcia | No | Likely |
An exact match on any of these fields returns zero results. A fuzzy comparison surfaces all of them with high similarity scores.
How Fuzzy Matching Works
Fuzzy matching algorithms compute a similarity score — a number between 0 and 1 indicating how similar two strings are. Different algorithms are better suited to different types of data. DeDuplica exposes the following comparators:
Levenshtein Comparator — measures the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into the other. “Smith” → “Smyth” = 1 edit = high similarity. Best for general text comparison, especially where typos and small differences are the main noise source.
Weighted Levenshtein Comparator — like Levenshtein but assigns different costs to different types of edits. Some changes are more significant than others — for instance, transposing two adjacent characters is a common typo that should cost less than inserting an entirely new word. Ideal for address comparison.
Jaro-Winkler Comparator — designed for short strings, giving more favourable scores to strings that match from the beginning. Well-suited for personal names, where first letters are more reliable than last (typos concentrate toward the end of words). “Jon” and “John” score very high.
QGram Comparator — breaks strings into overlapping substrings of length q and compares the resulting sets. Because it works on fragments rather than sequence, it handles cases where word order differs. “John Michael Smith” and “Smith, John M.” share most of their q-grams and score high even though the tokens are in a different order.
Person Name Comparator — a specialised comparator built specifically for personal names, accounting for common variations, nicknames, and misspellings (e.g., “Sam” matching “Samuel”). More accurate than general text comparators for name fields.
Metaphone Comparator — compares words based on their pronunciation rather than their spelling. “Thompson” and “Thomson” produce the same phonetic code and match even though they differ by one character. Useful for matching names or words that sound alike but are spelled differently.
Exact Comparator — checks whether two values are character-for-character identical (after any applied cleaners). Use for fields where only exact matches are valid — database IDs, product codes, unique identifiers.
Geoposition Comparator — compares latitude and longitude values to determine proximity. Use for location-based deduplication where two records should match if their coordinates are within a defined distance.
No single comparator is best for all fields. A personal name field benefits from the Person Name or Jaro-Winkler Comparator. An address field benefits from Weighted Levenshtein. A field where word order may vary suits the QGram Comparator. An ID field should use the Exact Comparator only.
Combining Multiple Fields
The real power comes from combining signals across multiple fields. In DeDuplica you assign each field a Rank from 1 to 10 — how strongly that field should influence the duplicate decision:
| Field | Comparator | Rank |
|---|---|---|
| First name | Person Name Comparator | 6 |
| Last name | Person Name Comparator | 7 |
| Exact Comparator | 9 | |
| Phone | Levenshtein + PhoneNumberCleaner | 8 |
Fields with a high rank have a decisive influence; low-rank fields contribute supporting evidence but don’t drive the decision alone. An overall Strictness setting (1–10) then controls how confident the system must be before flagging a pair as a duplicate — high strictness means only near-certain matches are surfaced, low strictness surfaces more candidates for human review.
This approach handles the noisiest real-world data while keeping false positives manageable.
Configuring Matching Rules in DeDuplica
In DeDuplica, matching rules are configured per Source Definition within a job. For each field you choose the appropriate comparator — Levenshtein for general text, Weighted Levenshtein for addresses, Jaro-Winkler for short strings and names, QGram for order-independent text, Person Name for personal names, Metaphone for pronunciation-based matching, or Exact for strict equality — and assign it a Rank from 1 to 10. The job-level Strictness setting controls how conservative the overall decision is. The system processes your table and returns identified duplicate groups.
The Testing a Job feature is the right way to tune these settings. Run against a sample of your data, inspect the identified groups, and adjust: lower Strictness or raise the Rank on your most reliable fields if too few duplicates are appearing; raise Strictness or lower the Rank on noisy fields if too many false positives appear.
Practical Advice on Rank and Strictness
Start with Strictness 6 (balanced) and assign ranks based on how genuinely identifying each field is — email or national ID at 9–10, city or country at 2–3. Review the test results. If you’re getting too many false positives, raise Strictness by 1–2 or reduce the rank of fields that are generating noise. If known duplicates aren’t appearing, lower Strictness or raise the rank on your strongest fields.
For automated processing without human review, use Strictness 8–10 so only near-certain matches are acted on. For review workflows where a human approves each group, Strictness 4–6 surfaces a broader set of candidates without taking any automatic action.
Configure your first fuzzy matching rule in minutes. Start for free with no credit card required.