Set It and Forget It: Scheduling Deduplication to Keep Data Clean Continuously
Ask a data team how they handle duplicate records and you will often hear a version of the same answer: “We do a cleanup periodically.” When pressed on the frequency: “Maybe twice a year. We should do it more often.”
This is not a discipline problem. It is a tooling problem. When running a deduplication job requires navigating a complex tool, preparing a manual export, or persuading a database administrator to run a custom script, twice a year begins to sound reasonable. When it requires clicking a button at a web interface, once a week becomes trivially achievable.
Deduplication should be scheduled so that it doesn’t require human initiation. Here’s how to think about scheduling frequency and configuration.
How Fast Does Data Degrade?
The rate of data quality degradation depends on how much new data enters the system. A rough rule: organisations with active CRM usage, regular contact imports, and live system integrations will accumulate duplicate rates of roughly 1–3 percentage points per quarter if no deduplication is running.
A contact database with 200,000 records will accrue roughly 2,000–6,000 new duplicates per quarter under normal usage. That is 8,000–24,000 new duplicates per year, compounding.
For an organisation that runs a cleanup once a year, the practical state of the data at cleanup time is what it was a year ago, plus a year’s worth of degradation. For an organisation running weekly deduplication, each run processes a small fraction of those records — and the data quality stays near its post-run state continuously.
Choosing a Schedule
Weekly is a sensible default for most CRM and operational databases. A weekly run catches new duplicates before they become embedded in workflows, before they appear in quarterly reports, and before they proliferate through integrations.
Daily is appropriate for high-volume ingestion environments — data warehouses receiving daily batch loads, marketing databases receiving daily campaign imports, or any context where a data quality incident causes immediate downstream effects (real-time dashboards, automated marketing, customer-facing views).
Monthly is a reasonable starting point when the data source is relatively stable — reference data tables, product catalogues, supplier databases that change slowly. It is also appropriate when the processing budget (rows per job, duplicate resolutions per month) is a binding constraint.
Setting Up Schedules in DeDuplica
Job scheduling in DeDuplica works at two levels:
- Job schedule — each job can have a configured schedule: hourly, daily, weekly, or monthly, at a specified time.
- Agent scheduling — if you’re using a local agent, the agent’s availability window and processing budget affect what jobs can run and how often.
The job scheduling documentation walks through schedule configuration in detail. Available scheduling frequency depends on plan — the Free plan supports once per day runs only; Standard and above unlock more frequent scheduled execution.
Monitoring Scheduled Runs
Automated jobs need monitoring. If a scheduled job fails — database connection error, query timeout, authentication expiry — you need to know. DeDuplica provides:
- Run history with status, duration, and record counts
- Error logging with diagnostic detail when a job fails
- Webhook notifications that can trigger alerts in your existing monitoring tools (PagerDuty, Slack, Teams, or whatever your team uses)
The system settings guide covers how to configure webhook alerts for run failures.
Scheduling is available from the Standard plan. See all plans or start your free trial.