Safe DLQ replay checklist
A practical runbook for replaying dead-letter messages without corrupting data or melting dependencies - with SQS/SNS and Kafka appendices.
A DLQ is not a retry button. It’s a collection of messages your system already proved it can’t safely process under current conditions.
Replaying without guardrails is how you turn an incident into:
- data corruption
- duplicate side effects
- dependency meltdown
- a second incident you caused yourself
This is a practical runbook you can copy into ops docs.
TL;DR one-page checklist
0) Preconditions
- Fix the root cause (or replay will deterministically fail again)
- Write a replay plan (scope, rate, validation, rollback)
- Assign one driver + one approver + a comms channel
1) Stop the bleeding
- Pause the failing consumer or gate it into “quarantine mode”
- Confirm DLQ growth has stopped (or is understood and bounded)
- Confirm dependency health + capacity (DB, APIs, downstream services)
2) Classify what’s in the DLQ
- Sample 20–50 messages across the time window
- Identify dominant failure class(es): transient, throttling, permanent, poison pill, contract drift
- Decide whether ordering matters (per key / per partition)
3) Choose replay strategy
- Default to windowed replay (small batch you can validate)
- Use selective replay if only some event types/keys are safe
- Quarantine poison pills automatically (don’t let one stall the batch)
4) Prove safety (non-negotiable)
- Ensure idempotency/dedup exists and survives restarts
- Ensure side effects are idempotent or gated during replay (email/billing/webhooks)
- Confirm schema/version compatibility
- Label replay in logs/metrics/traces (replay_run_id)
5) Execute with guardrails
- Start low (1-5% of normal), ramp with checkpoints (10/25/50/100)
- Enforce caps: concurrency + rate + time budgets
- Monitor: error rate by class, dependency saturation, DLQ re-entry rate
6) Validate and close
- Run domain correctness checks (invariants, reconciliations)
- Classify residual DLQ (irrecoverable vs manual remediation vs later fix)
- File post-incident improvements (tests, alerts, runbooks, tooling)
The default replay pattern I recommend
Windowed replay
Pick a small time window from the DLQ (or a fixed count), replay it slowly, validate correctness, then expand.
Why:
- lets you detect “we’re duplicating side effects” before you duplicate a lot
- protects dependencies from a sudden traffic cliff
- gives you clean rollback boundaries
Detailed runbook
Phase 1 - Triage and containment
- Pause consumer / disable redrive loops
- Capture samples with error reasons + metadata (event type, key/tenant, timestamp)
- Confirm dependencies are stable and ready for replay load
- Define stop conditions (error thresholds, saturation thresholds)
Phase 2 - Failure-mode classification
Common buckets:
- Transient dependency failures (timeouts, resets)
- Throttling/overload (429s, queue full, rate-limits)
- Permanent errors (validation, invariants)
- Poison pills (bad payloads, unhandled edge cases)
- Contract drift (schema/version mismatch)
Key questions:
- Can you reproduce failure for a single message deterministically?
- Is the fix backward compatible with old messages?
- Is ordering required for correctness?
Phase 3 - Safety rails (idempotency + side effects)
Idempotency
- Identify the idempotency key (message id / business id / (tenant, entity, version))
- Ensure processing is “exactly once for that key” via:
- unique constraints + upsert semantics, or
- idempotency store, or
- dedup cache with persistence
Side effects
- Enumerate side effects per message type (emails, billing, webhooks, external writes)
- Ensure they are idempotent, suppressed during replay, or routed to a sandbox
- Add a replay mode flag if needed (“don’t notify on replay”)
Phase 4 - Observability + correctness checks
Minimum observability:
- replay_run_id label for all logs/metrics/traces
- attempt_count + failure_class metrics
- dashboards for: throughput, error rate (by class), dependency latency/errors, DLQ drain/re-entry
Correctness checks (choose a few):
- record-count invariants
- uniqueness invariants (no duplicates by business key)
- reconciliation checks (balances, totals, state transitions)
- audit trail checks
Phase 5 - Execution plan (rate control + ramp)
- Pick a rate-control mechanism (token bucket, max concurrency, msgs/sec)
- Start 1–5% throughput; hold 10–15 min
- Ramp 10% → hold → 25% → hold → 50% → hold → 100%
- Poison pill handling: N failures → quarantine + continue (don’t stall)
Phase 6 - Rollback plan
You must be able to stop quickly and undo if needed:
- How do we stop replay immediately?
- Where do “failed again” messages go (new DLQ vs quarantine store)?
- If replay caused bad writes, what is the compensating action?
- Who approves rollback?
Phase 7 - Closeout
- Residual DLQ items are classified and documented
- Improvements filed:
- classifier improvements / better error typing
- tests for the failure mode
- alerts tuned to the real signal
- replay tooling hardened
Appendix A - SQS/SNS replay mechanics
Identify what you actually have
- Is this an SQS DLQ for an SQS source queue?
- Or an SNS subscription DLQ (delivery failures routed to an SQS queue)?
- Is the source queue Standard or FIFO?
FIFO changes correctness constraints:
- ordering guarantees (per message group)
- deduplication behavior
Guardrails that matter in SQS
- Visibility timeout: too low → duplicates; too high → slow recovery
- maxReceiveCount: defines when messages dead-letter (poison pill detector)
- batch size + concurrency (especially with Lambda consumers): this is your real replay throttle
Key checks:
- Confirm visibility timeout matches worst-case processing time (plus buffer)
- Confirm DLQ redrive won’t flood the source queue beyond what consumers can handle
- If using Lambda event source mapping:
- set reserved concurrency (or a max concurrency) for controlled ramp
- avoid large batch sizes until safe
- ensure partial batch failures don’t re-drive the whole batch unnecessarily
Safer replay patterns for SQS
Preferred:
- Controlled replay worker that reads the DLQ and re-enqueues to source at a bounded rate
(labels replay traffic, quarantines poison pills, and stops instantly)
Avoid (or use only with strong throttles):
- bulk redrive of the entire DLQ back to the source queue without a ramp plan
SNS subscription DLQs
SNS delivery failures often end up in an SQS DLQ associated with the subscription. Replay generally means:
- re-publish the message (or push it to the intended destination) using a controlled tool
- don’t assume “redrive” preserves ordering, dedup, or throttling semantics
Appendix B - Kafka replay mechanics
Kafka “DLQ” is usually implemented as a dead-letter topic. Replay typically happens via one of two approaches:
Approach 1 (recommended): replay consumer reads the DLQ topic
- Consume from DLQ topic with a dedicated replay consumer group
- Re-publish to the original topic (or a recovery topic) at a bounded rate
- Preserve the message key if ordering/correctness relies on partitioning
Key checks:
- Use a dedicated replay consumer group (never reuse the prod group)
- Preserve key (partitioning correctness) unless you have a reason not to
- Rate limit using:
- max.poll.records
- pause/resume partitions
- bounded concurrency in workers
- Commit offsets only after side effects are complete (avoid “processed but not committed” ambiguity)
Approach 2 (dangerous): offset reset
Offset resets are powerful and easy to misuse.
Rules:
- Never reset offsets for the production consumer group unless you are absolutely sure
- Prefer creating a new consumer group for replay
- Validate ordering constraints (Kafka ordering is per partition)
Key checks:
- Confirm the exact topic/partition/offset range
- Confirm the replay group id
- Confirm commit strategy (manual vs auto-commit)
- Confirm stop conditions and rollback plan
Poison pill handling in Kafka
- Route failed messages back to DLQ with headers:
- original topic, partition, offset
- failure class / reason
- replay_run_id
- Quarantine after N attempts; do not spin on the same message forever
Validation and safety rails
Kafka replay is still at-least-once unless you’ve built stronger semantics. So you still need:
- idempotency keys
- dedup semantics
- side-effect gating
- replay labeling
Common mistakes
- Replaying before fixing root cause
- Full-speed replay that melts dependencies
- No idempotency proof
- No quarantine path for poison pills
- No correctness validation (“green dashboards, wrong data”)
- No replay labeling (you can’t separate replay issues from new production issues)