Writing

Safe DLQ replay checklist

7 min read · Operability · Safe recovery · Runbooks · Event-driven · SQS · Kafka

A practical runbook for replaying dead-letter messages without corrupting data or melting dependencies, with SQS/SNS and Kafka appendices.

On this page

TL;DR one-page checklist
The default replay pattern I recommend
Detailed runbook
Common mistakes

A DLQ should not be a retry button. It's a collection of messages your system already proved it can't safely process under current conditions.

Replaying without guardrails is how you turn an incident into:

data corruption
duplicate side effects
dependency meltdown
a second incident you caused yourself

This is a practical runbook you can copy into ops docs.

TL;DR one-page checklist

0) Preconditions

Fix the root cause (or replay will deterministically fail again)
Write a replay plan (scope, rate, validation, rollback)
Assign one driver + one approver + a comms channel

1) Stop the bleeding

Pause the failing consumer or gate it into “quarantine mode”
Confirm DLQ growth has stopped (or is understood and bounded)
Confirm dependency health + capacity (DB, APIs, downstream services)

2) Classify what’s in the DLQ

Sample 20–50 messages across the time window
Identify dominant failure class(es): transient, throttling, permanent, poison pill, contract drift
Decide whether ordering matters (per key / per partition)

3) Choose replay strategy

Default to windowed replay (small batch you can validate)
Use selective replay if only some event types/keys are safe
Quarantine poison pills automatically (don’t let one stall the batch)

4) Prove safety (non-negotiable)

Ensure idempotency/dedup exists and survives restarts
Ensure side effects are idempotent or gated during replay (email/billing/webhooks)
Confirm schema/version compatibility
Label replay in logs/metrics/traces (replay_run_id)

5) Execute with guardrails

Start low (1-5% of normal), ramp with checkpoints (10/25/50/100)
Enforce caps: concurrency + rate + time budgets
Monitor: error rate by class, dependency saturation, DLQ re-entry rate

6) Validate and close

Run domain correctness checks (invariants, reconciliations)
Classify residual DLQ (irrecoverable vs manual remediation vs later fix)
File post-incident improvements (tests, alerts, runbooks, tooling)

Windowed replay

Pick a small time window from the DLQ (or a fixed count), replay it slowly, validate correctness, then expand.

Why:

lets you detect “we’re duplicating side effects” before you duplicate a lot
protects dependencies from a sudden traffic cliff
gives you clean rollback boundaries

Detailed runbook

Phase 1 - Triage and containment

Pause consumer / disable redrive loops
Capture samples with error reasons + metadata (event type, key/tenant, timestamp)
Confirm dependencies are stable and ready for replay load
Define stop conditions (error thresholds, saturation thresholds)

Phase 2 - Failure-mode classification

Common buckets:

Transient dependency failures (timeouts, resets)
Throttling/overload (429s, queue full, rate-limits)
Permanent errors (validation, invariants)
Poison pills (bad payloads, unhandled edge cases)
Contract drift (schema/version mismatch)

Key questions:

Can you reproduce failure for a single message deterministically?
Is the fix backward compatible with old messages?
Is ordering required for correctness?

Phase 3 - Safety rails (idempotency + side effects)

Idempotency

Identify the idempotency key (message id / business id / (tenant, entity, version))
Ensure processing is “exactly once for that key” via:
- unique constraints + upsert semantics, or
- idempotency store, or
- dedup cache with persistence

Side effects

Enumerate side effects per message type (emails, billing, webhooks, external writes)
Ensure they are idempotent, suppressed during replay, or routed to a sandbox
Add a replay mode flag if needed (“don’t notify on replay”)

Phase 4 - Observability + correctness checks

Minimum observability:

replay_run_id label for all logs/metrics/traces
attempt_count + failure_class metrics
dashboards for: throughput, error rate (by class), dependency latency/errors, DLQ drain/re-entry

Correctness checks (choose a few):

record-count invariants
uniqueness invariants (no duplicates by business key)
reconciliation checks (balances, totals, state transitions)
audit trail checks

Phase 5 - Execution plan (rate control + ramp)

Pick a rate-control mechanism (token bucket, max concurrency, msgs/sec)
Start 1–5% throughput; hold 10–15 min
Ramp 10% → hold → 25% → hold → 50% → hold → 100%
Poison pill handling: N failures → quarantine + continue (don’t stall)

Phase 6 - Rollback plan

You must be able to stop quickly and undo if needed:

How do we stop replay immediately?
Where do “failed again” messages go (new DLQ vs quarantine store)?
If replay caused bad writes, what is the compensating action?
Who approves rollback?

Phase 7 - Closeout

Residual DLQ items are classified and documented
Improvements filed:
- classifier improvements / better error typing
- tests for the failure mode
- alerts tuned to the real signal
- replay tooling hardened

Identify what you actually have

Is this an SQS DLQ for an SQS source queue?
Or an SNS subscription DLQ (delivery failures routed to an SQS queue)?
Is the source queue Standard or FIFO?

FIFO changes correctness constraints:

ordering guarantees (per message group)
deduplication behavior

Guardrails that matter in SQS

Visibility timeout: too low → duplicates; too high → slow recovery
maxReceiveCount: defines when messages dead-letter (poison pill detector)
batch size + concurrency (especially with Lambda consumers): this is your real replay throttle

Key checks:

Confirm visibility timeout matches worst-case processing time (plus buffer)
Confirm DLQ redrive won’t flood the source queue beyond what consumers can handle
If using Lambda event source mapping:
- set reserved concurrency (or a max concurrency) for controlled ramp
- avoid large batch sizes until safe
- ensure partial batch failures don’t re-drive the whole batch unnecessarily

Safer replay patterns for SQS

Preferred:

Controlled replay worker that reads the DLQ and re-enqueues to source at a bounded rate

(labels replay traffic, quarantines poison pills, and stops instantly)

Avoid (or use only with strong throttles):

bulk redrive of the entire DLQ back to the source queue without a ramp plan

SNS delivery failures often end up in an SQS DLQ associated with the subscription. Replay generally means:

re-publish the message (or push it to the intended destination) using a controlled tool
don’t assume “redrive” preserves ordering, dedup, or throttling semantics

Appendix B - Kafka replay mechanics

Kafka “DLQ” is usually implemented as a dead-letter topic. Replay typically happens via one of two approaches:

Approach 1 (recommended): replay consumer reads the DLQ topic

Consume from DLQ topic with a dedicated replay consumer group
Re-publish to the original topic (or a recovery topic) at a bounded rate
Preserve the message key if ordering/correctness relies on partitioning

Key checks:

Use a dedicated replay consumer group (never reuse the prod group)
Preserve key (partitioning correctness) unless you have a reason not to
Rate limit using:
- max.poll.records
- pause/resume partitions
- bounded concurrency in workers
Commit offsets only after side effects are complete (avoid “processed but not committed” ambiguity)

Approach 2 (dangerous): offset reset

Offset resets are powerful and easy to misuse.

Rules:

Never reset offsets for the production consumer group unless you are absolutely sure
Prefer creating a new consumer group for replay
Validate ordering constraints (Kafka ordering is per partition)

Key checks:

Confirm the exact topic/partition/offset range
Confirm the replay group id
Confirm commit strategy (manual vs auto-commit)
Confirm stop conditions and rollback plan

Poison pill handling in Kafka

Route failed messages back to DLQ with headers:
- original topic, partition, offset
- failure class / reason
- replay_run_id
Quarantine after N attempts; do not spin on the same message forever

Validation and safety rails

Kafka replay is still at-least-once unless you’ve built stronger semantics. So you still need:

idempotency keys
dedup semantics
side-effect gating
replay labeling

Common mistakes

Replaying before fixing root cause
Full-speed replay that melts dependencies
No idempotency proof
No quarantine path for poison pills
No correctness validation (“green dashboards, wrong data”)
No replay labeling (you can’t separate replay issues from new production issues)

Back to writing

Safe DLQ replay checklist

TL;DR one-page checklist

0) Preconditions

1) Stop the bleeding

2) Classify what’s in the DLQ

3) Choose replay strategy

4) Prove safety (non-negotiable)

5) Execute with guardrails

6) Validate and close

The default replay pattern I recommend

Windowed replay

Detailed runbook

Phase 1 - Triage and containment

Phase 2 - Failure-mode classification

Phase 3 - Safety rails (idempotency + side effects)

Phase 4 - Observability + correctness checks

Phase 5 - Execution plan (rate control + ramp)

Phase 6 - Rollback plan

Phase 7 - Closeout

Appendix A - SQS/SNS replay mechanics

Identify what you actually have

Guardrails that matter in SQS

Safer replay patterns for SQS

SNS subscription DLQs

Appendix B - Kafka replay mechanics

Approach 1 (recommended): replay consumer reads the DLQ topic

Approach 2 (dangerous): offset reset

Poison pill handling in Kafka

Validation and safety rails

Common mistakes