Webhook-Driven Deduplication: Automating Data Cleanup Across Any System
Scheduled deduplication jobs are a significant improvement over manual one-off cleanups. But the next level of data quality operations is automated pipeline integration — where deduplication results automatically trigger downstream actions in your infrastructure without human intervention.
DeDuplica supports this through webhook notifications. For every duplicate pair found during a job run, DeDuplica sends a separate HTTP POST to a URL you configure — carrying the full record data for that specific match. Rather than a single “job finished” summary, your downstream systems receive one event per duplicate as the job processes them.
What the Webhook Carries
The webhook payload includes:
- Job and duplicate identifiers
- Date timestamps
- Match probability
- Status
- Source record JSONs
- Merge output JSON
This gives downstream systems everything they need to act on each individual match — merge records in a CRM the moment a duplicate is found, write audit entries in real time, or trigger a downstream action scoped precisely to those two records.
Example Integration Patterns
Real-time CRM merge: As DeDuplica processes records, each identified duplicate pair fires a webhook. Your CRM integration receives the payload — including both source record JSONs and the merge output — and immediately merges or flags the records in Salesforce, HubSpot, or Dynamics 365. By the time the job finishes, the CRM is already clean.
Audit log per duplicate: Each webhook writes a row to a data quality audit table: which records were matched, the match probability, when it was found, and the job that found it. Over time this builds a complete history of every duplicate ever detected, useful for compliance reporting and data lineage.
High-probability alerting: Your webhook consumer inspects the match probability field. Pairs above a certain confidence level immediately trigger a Slack or Teams message to the data engineering channel — so the team can review critical merges before they propagate downstream. Lower-confidence matches queue for batch review.
Downstream record sync: When a duplicate is found in a staging database, the webhook triggers an update in the corresponding downstream system — correcting the same duplicate in the production copy without waiting for the next ETL cycle. The merge output JSON in the payload provides the already-resolved record to push.
Setting Up Webhooks in DeDuplica
Webhook configuration is per-subscription. In the system settings, you configure webhook endpoint that can then be used by individual jobs. Because webhooks fire once per duplicate found, a job that surfaces 500 duplicate pairs will deliver 500 POST requests to your endpoint — design your consumer accordingly.
The endpoint needs to accept an HTTP POST with a JSON body. Any internet-accessible URL works — a Lambda function, an Azure Function, a Make scenario, a webhook relay like Hookdeck, or a simple endpoint in your own application.
For environments where DeDuplica’s cloud scheduler cannot reach an internal endpoint, the webhook can be configured to fire to an internet-facing relay that then forwards to your internal network.
Combining Webhooks With the Local Agent
For organisations processing sensitive data that must not leave their network, DeDuplica offers a local agent that runs on-premises. The agent handles all data processing inside the network perimeter; the DeDuplica cloud service handles scheduling and result management only.
Webhooks work with the local agent. The cloud service fires a webhook for each duplicate the agent finds, regardless of where the processing occurred. This means the same per-duplicate integration patterns — real-time CRM merges, audit log entries, alerting.
A Note on Reliability
Webhooks are fire-and-forget in most implementations. DeDuplica includes retry logic for failed webhook deliveries (network timeout, non-2xx response) with exponential backoff. The webhook documentation covers the retry behaviour and how to inspect delivery logs.
Because a busy job delivers many webhooks in a short window, ensure your endpoint can handle the throughput without throttling. For critical downstream pipelines, build idempotent consumers — keyed on the duplicate identifier — so that a retry delivering the same payload twice doesn’t write a duplicate audit row or trigger a duplicate merge action.
Webhook integration is available on Standard and above plans. See the full feature comparison or start your free trial.