Why redress
My retry philosophy: classification-first, bounded unknowns, capped exponential backoff, and observability hooks.
Short version. The full deep dive includes design tradeoffs, examples, and implementation notes.
Most retry logic in the wild is either too naive ("just retry") or too timid ("never retry"). In real systems, partial failure is normal, dependencies are flaky, and tail latency is where user experience goes to die.
My bias is simple: retries should improve outcomes without amplifying load or hiding real failures. Redress exists to make that behavior explicit, predictable, and observable.
The problem with naive retries
Naive retries fail in predictable ways:
- They amplify load. When a dependency is unhealthy, retries increase traffic exactly when the dependency can least handle it.
- They shift work into the tail. You might "succeed" more often, but your p95/p99 explodes.
- They synchronize clients. Exponential backoff without jitter creates thundering herds.
- They hide incidents. The system looks "fine" until it isn't - and then you have no clear signal about what actually happened.
- They treat all failures the same. Timeouts, 429s, validation errors, and logic bugs should not share a strategy.
The result: retries become a silent control plane that nobody owns.
The philosophy: policy, not hope
Redress is built around a small set of rules.
1) Classification-first
Retry is not a boolean. You need classes of failure with different behavior.
Examples:
- Transient (timeouts, connect resets) -> retry cautiously
- Capacity / throttling (429s, queue full) -> retry with stronger backoff + lower concurrency
- Permanent (validation, schema mismatch) -> don't retry; quarantine or drop
- Unknown -> bounded retries with tight budgets (fail fast)
2) Bounded unknowns
If you can't explain why you're retrying, you don't get infinite attempts.
Defaults should be safe:
- small attempt caps
- strict time budgets
- aggressive observability
3) Budgeted retries
Retries need a budget - time and/or attempts - so they can't silently dominate end-to-end latency.
A practical rule:
- "If we can't succeed quickly, stop and surface."
4) Capped exponential backoff with jitter
Backoff is a control system, not a vibe.
- exponential backoff smooths contention
- jitter prevents synchronization
- caps prevent retries from turning into multi-minute tail inflation
5) Load-aware behavior (don't melt your dependencies)
Retries should cooperate with backpressure.
That means:
- concurrency caps
- token buckets / rate limits where appropriate
- making "retry storms" mechanically difficult
6) Observability is part of the contract
Every retry strategy should emit structured signals:
- retry class
- attempt count
- chosen delay
- final outcome
- correlation to dependency errors/latency
If you can't see retries, you can't manage them.
What redress implements (at a high level)
Redress models retry behavior as a policy:
- a classifier maps failures into retry classes
- each class has its own strategy (delay function, caps, budgets)
- the policy emits structured hooks so retries are measurable and debuggable
This turns retry behavior from scattered ad-hoc loops into a single owned mechanism with clear defaults.
When you should not retry
A good retry library makes "don't retry" easy.
Do not retry:
- validation / schema errors
- business logic invariants
- known poison pills
- non-idempotent side effects without protection
In those cases, correct handling is usually:
- quarantine
- dead-letter
- operator workflow + safe replay
If you want the detailed design tradeoffs, examples, and implementation notes: