How to Configure Matching Rules That Actually Work

How to Configure Matching Rules That Actually Work

January 27, 2026

The quality of a deduplication run is determined almost entirely by the quality of the matching rules. A well-configured rule set finds real duplicates with a low false-positive rate. A poorly configured one either misses most duplicates (too strict) or flags thousands of non-duplicates for review (too loose).

This is not a theoretically complex problem, but it requires iteration against real data. Here’s a practical framework.

Start With the Identity Signal

Every entity has one or more fields that are the strongest signal of identity. For contacts: email address. For companies: company name + registration number. For products: SKU or barcode. For patients: national identifier or date-of-birth + surname combination.

Begin your matching rule with the strongest identity signal. If you’re deduplicating contacts, start with email address on exact match. Run the job. Review the results. These are your highest-confidence duplicates — records that share an identical email are almost certainly duplicates unless your database has multi-person accounts sharing email addresses (a known edge case for some B2C datasets).

Layer Secondary Signals

After your primary identifier, add secondary signals that confirm or qualify the match. For contacts deduplicating on email, add first name and last name as supporting fields with a low rank (2–4). Any pair of records sharing the same email but with completely different names is suspicious — these might be shared mailboxes, not duplicates. The secondary fields nudge the overall match score down in those edge cases.

For records where the primary identifier is noisy or missing — company name is a good example — start with that as your highest-ranked field and use address, phone number, or website domain as lower-ranked supporting signals.

A typical company-matching rule set might be:

FieldComparatorRank
Company NameJaro-Winkler (short text)7
CityExact Comparator3
PhoneLevenshtein + PhoneNumberCleaner8
Website DomainExact Comparator6

DeDuplica uses the per-field ranks to estimate how likely two records represent the same entity. The Strictness setting then determines how confident the system must be before flagging a pair as a duplicate.

The False Positive Problem

False positives — records flagged as duplicates that aren’t — are the single biggest operational problem in deduplication. They consume review time, erode trust in the system, and in the worst case lead to merged records that should have remained separate.

The main causes:

Common names in small tables — “John Smith” in a table of 500 contacts will match many other “John Smith” entries even if they are different people. Add a stronger secondary signal (email, phone, address) before including name as a primary field.

Normalisation mismatch — “Ltd” and “Limited” are the same. “Street” and “St.” are the same. If normalisation isn’t applied before comparison, the comparator has to bridge the gap itself, which can produce inconsistent scores. Use a cleaner (RegexpCleaner, LowerCaseNormalizeCleaner) to normalise values before comparison wherever the normalisation is deterministic.

Shared values — some fields are duplicated by design. Multiple employees from the same company share a company name and address. Don’t deduplicate contacts using only company name + city, or you will propose merging everyone at the same firm.

Use Testing to Calibrate Rank and Strictness

DeDuplica’s Testing a Job runs your matching rules against a defined sample set and returns the identified groups without modifying any records. This is the right way to tune your configuration before letting it loose on production data.

The two controls to tune are:

  • Field Rank (1–10) — how strongly each field influences the decision. Email or national ID warrants a rank of 9–10. City alone warrants 2–3.
  • Strictness (1–10) — how conservative the algorithm is overall. High strictness (8–10) means only near-certain matches are flagged; low strictness (3–5) surfaces more candidates for human review.

The workflow:

  1. Assign ranks to your fields based on how identifying each one genuinely is. Start with a Strictness of 6 (balanced).
  2. Run a test job against a sample of your data.
  3. Inspect the identified groups. What proportion are genuine duplicates?
  4. Too many false positives? Raise Strictness by 1–2, or lower the rank on fields that are generating noise.
  5. Too few matches (known duplicates not appearing)? Lower Strictness, or raise the rank on your most reliable fields.

Three to five iterations typically yields a configuration with acceptable performance for production.

Field Preparation Matters

The quality of matching depends heavily on how fields are prepared before comparison. DeDuplica allows you to configure field-level regex transformations

  • LowerCaseNormalizeCleaner — lowercases all letters, trims whitespace, normalises internal whitespace, and removes accents (é → e). The most widely used cleaner; eliminates case and accent variation before any comparator runs.
  • TrimCleaner — trims leading and trailing whitespace only. Useful when you want to preserve internal casing but fix padding issues.
  • PhoneNumberCleaner — normalises phone numbers to a standard format (+CC number). Apply this before a Levenshtein comparison on phone fields so the comparator works with digits in a consistent format, not raw strings.
  • RegexpCleaner — applies a regular expression to extract or discard a specific part of a field value. Useful for stripping company suffixes (“Ltd”, “Inc”, “LLC”) before a Jaro-Winkler comparison on company names, so the comparator focuses on the distinctive part of the name.
  • DigitsOnlyCleaner — strips everything that is not a digit. A simpler alternative to PhoneNumberCleaner when international normalisation isn’t needed — useful for zip codes, account numbers, or any field where only the digits matter.

Spending an hour on field preparation often has more impact on result quality than an hour adjusting Rank and Strictness settings.

Document Your Rules

Once you have a working rule set, document what you configured and why. Matching rules are shared team knowledge. The person who configured the rule in January may not be the person who debugs a false positive in October. A comment block in configuration documentation explaining “we exclude records with is_test = true” or “we normalise phone numbers to E.164 before comparison” saves significant diagnostic time.


Ready to build your first matching rule? The source definition guide covers all configuration options. Start your free trial.