Writing

Safe DLQ replay checklist

7 min read · Operability · Safe recovery · Runbooks · Event-driven · SQS · Kafka

A practical runbook for replaying dead-letter messages without corrupting data or melting dependencies - with SQS/SNS and Kafka appendices.

A DLQ is not a retry button. It’s a collection of messages your system already proved it can’t safely process under current conditions.

Replaying without guardrails is how you turn an incident into:

  • data corruption
  • duplicate side effects
  • dependency meltdown
  • a second incident you caused yourself

This is a practical runbook you can copy into ops docs.

TL;DR one-page checklist

0) Preconditions

  • Fix the root cause (or replay will deterministically fail again)
  • Write a replay plan (scope, rate, validation, rollback)
  • Assign one driver + one approver + a comms channel

1) Stop the bleeding

  • Pause the failing consumer or gate it into “quarantine mode”
  • Confirm DLQ growth has stopped (or is understood and bounded)
  • Confirm dependency health + capacity (DB, APIs, downstream services)

2) Classify what’s in the DLQ

  • Sample 20–50 messages across the time window
  • Identify dominant failure class(es): transient, throttling, permanent, poison pill, contract drift
  • Decide whether ordering matters (per key / per partition)

3) Choose replay strategy

  • Default to windowed replay (small batch you can validate)
  • Use selective replay if only some event types/keys are safe
  • Quarantine poison pills automatically (don’t let one stall the batch)

4) Prove safety (non-negotiable)

  • Ensure idempotency/dedup exists and survives restarts
  • Ensure side effects are idempotent or gated during replay (email/billing/webhooks)
  • Confirm schema/version compatibility
  • Label replay in logs/metrics/traces (replay_run_id)

5) Execute with guardrails

  • Start low (1-5% of normal), ramp with checkpoints (10/25/50/100)
  • Enforce caps: concurrency + rate + time budgets
  • Monitor: error rate by class, dependency saturation, DLQ re-entry rate

6) Validate and close

  • Run domain correctness checks (invariants, reconciliations)
  • Classify residual DLQ (irrecoverable vs manual remediation vs later fix)
  • File post-incident improvements (tests, alerts, runbooks, tooling)

The default replay pattern I recommend

Windowed replay

Pick a small time window from the DLQ (or a fixed count), replay it slowly, validate correctness, then expand.

Why:

  • lets you detect “we’re duplicating side effects” before you duplicate a lot
  • protects dependencies from a sudden traffic cliff
  • gives you clean rollback boundaries

Detailed runbook

Phase 1 - Triage and containment

  • Pause consumer / disable redrive loops
  • Capture samples with error reasons + metadata (event type, key/tenant, timestamp)
  • Confirm dependencies are stable and ready for replay load
  • Define stop conditions (error thresholds, saturation thresholds)

Phase 2 - Failure-mode classification

Common buckets:

  • Transient dependency failures (timeouts, resets)
  • Throttling/overload (429s, queue full, rate-limits)
  • Permanent errors (validation, invariants)
  • Poison pills (bad payloads, unhandled edge cases)
  • Contract drift (schema/version mismatch)

Key questions:

  • Can you reproduce failure for a single message deterministically?
  • Is the fix backward compatible with old messages?
  • Is ordering required for correctness?

Phase 3 - Safety rails (idempotency + side effects)

Idempotency

  • Identify the idempotency key (message id / business id / (tenant, entity, version))
  • Ensure processing is “exactly once for that key” via:
    • unique constraints + upsert semantics, or
    • idempotency store, or
    • dedup cache with persistence

Side effects

  • Enumerate side effects per message type (emails, billing, webhooks, external writes)
  • Ensure they are idempotent, suppressed during replay, or routed to a sandbox
  • Add a replay mode flag if needed (“don’t notify on replay”)

Phase 4 - Observability + correctness checks

Minimum observability:

  • replay_run_id label for all logs/metrics/traces
  • attempt_count + failure_class metrics
  • dashboards for: throughput, error rate (by class), dependency latency/errors, DLQ drain/re-entry

Correctness checks (choose a few):

  • record-count invariants
  • uniqueness invariants (no duplicates by business key)
  • reconciliation checks (balances, totals, state transitions)
  • audit trail checks

Phase 5 - Execution plan (rate control + ramp)

  • Pick a rate-control mechanism (token bucket, max concurrency, msgs/sec)
  • Start 1–5% throughput; hold 10–15 min
  • Ramp 10% → hold → 25% → hold → 50% → hold → 100%
  • Poison pill handling: N failures → quarantine + continue (don’t stall)

Phase 6 - Rollback plan

You must be able to stop quickly and undo if needed:

  • How do we stop replay immediately?
  • Where do “failed again” messages go (new DLQ vs quarantine store)?
  • If replay caused bad writes, what is the compensating action?
  • Who approves rollback?

Phase 7 - Closeout

  • Residual DLQ items are classified and documented
  • Improvements filed:
    • classifier improvements / better error typing
    • tests for the failure mode
    • alerts tuned to the real signal
    • replay tooling hardened

Appendix A - SQS/SNS replay mechanics

Identify what you actually have

  • Is this an SQS DLQ for an SQS source queue?
  • Or an SNS subscription DLQ (delivery failures routed to an SQS queue)?
  • Is the source queue Standard or FIFO?

FIFO changes correctness constraints:

  • ordering guarantees (per message group)
  • deduplication behavior

Guardrails that matter in SQS

  • Visibility timeout: too low → duplicates; too high → slow recovery
  • maxReceiveCount: defines when messages dead-letter (poison pill detector)
  • batch size + concurrency (especially with Lambda consumers): this is your real replay throttle

Key checks:

  • Confirm visibility timeout matches worst-case processing time (plus buffer)
  • Confirm DLQ redrive won’t flood the source queue beyond what consumers can handle
  • If using Lambda event source mapping:
    • set reserved concurrency (or a max concurrency) for controlled ramp
    • avoid large batch sizes until safe
    • ensure partial batch failures don’t re-drive the whole batch unnecessarily

Safer replay patterns for SQS

Preferred:

  • Controlled replay worker that reads the DLQ and re-enqueues to source at a bounded rate

(labels replay traffic, quarantines poison pills, and stops instantly)

Avoid (or use only with strong throttles):

  • bulk redrive of the entire DLQ back to the source queue without a ramp plan

SNS subscription DLQs

SNS delivery failures often end up in an SQS DLQ associated with the subscription. Replay generally means:

  • re-publish the message (or push it to the intended destination) using a controlled tool
  • don’t assume “redrive” preserves ordering, dedup, or throttling semantics

Appendix B - Kafka replay mechanics

Kafka “DLQ” is usually implemented as a dead-letter topic. Replay typically happens via one of two approaches:

  • Consume from DLQ topic with a dedicated replay consumer group
  • Re-publish to the original topic (or a recovery topic) at a bounded rate
  • Preserve the message key if ordering/correctness relies on partitioning

Key checks:

  • Use a dedicated replay consumer group (never reuse the prod group)
  • Preserve key (partitioning correctness) unless you have a reason not to
  • Rate limit using:
    • max.poll.records
    • pause/resume partitions
    • bounded concurrency in workers
  • Commit offsets only after side effects are complete (avoid “processed but not committed” ambiguity)

Approach 2 (dangerous): offset reset

Offset resets are powerful and easy to misuse.

Rules:

  • Never reset offsets for the production consumer group unless you are absolutely sure
  • Prefer creating a new consumer group for replay
  • Validate ordering constraints (Kafka ordering is per partition)

Key checks:

  • Confirm the exact topic/partition/offset range
  • Confirm the replay group id
  • Confirm commit strategy (manual vs auto-commit)
  • Confirm stop conditions and rollback plan

Poison pill handling in Kafka

  • Route failed messages back to DLQ with headers:
    • original topic, partition, offset
    • failure class / reason
    • replay_run_id
  • Quarantine after N attempts; do not spin on the same message forever

Validation and safety rails

Kafka replay is still at-least-once unless you’ve built stronger semantics. So you still need:

  • idempotency keys
  • dedup semantics
  • side-effect gating
  • replay labeling

Common mistakes

  • Replaying before fixing root cause
  • Full-speed replay that melts dependencies
  • No idempotency proof
  • No quarantine path for poison pills
  • No correctness validation (“green dashboards, wrong data”)
  • No replay labeling (you can’t separate replay issues from new production issues)