Writing

Why redress

3 min read · Retries · Operability

My retry philosophy: classification-first, bounded unknowns, capped exponential backoff, and observability hooks.

Short version. The full deep dive includes design tradeoffs, examples, and implementation notes.

On this page

The problem with naive retries
The philosophy: policy, not hope
What redress implements (at a high level)
When you should *not* retry

Most retry logic in the wild is either too naive ("just retry") or too timid ("never retry"). In real systems, partial failure is normal, dependencies are flaky, and tail latency is where user experience goes to die.

My bias is simple: retries should improve outcomes without amplifying load or hiding real failures. Redress exists to make that behavior explicit, predictable, and observable.

The problem with naive retries

Naive retries fail in predictable ways:

They amplify load. When a dependency is unhealthy, retries increase traffic exactly when the dependency can least handle it.
They shift work into the tail. You might "succeed" more often, but your p95/p99 explodes.
They synchronize clients. Exponential backoff without jitter creates thundering herds.
They hide incidents. The system looks "fine" until it isn't - and then you have no clear signal about what actually happened.
They treat all failures the same. Timeouts, 429s, validation errors, and logic bugs should not share a strategy.

The result: retries become a silent control plane that nobody owns.

The philosophy: policy, not hope

Redress is built around a small set of rules.

1) Classification-first

Retry is not a boolean. You need classes of failure with different behavior.

Examples:

Transient (timeouts, connect resets) -> retry cautiously
Capacity / throttling (429s, queue full) -> retry with stronger backoff + lower concurrency
Permanent (validation, schema mismatch) -> don't retry; quarantine or drop
Unknown -> bounded retries with tight budgets (fail fast)

2) Bounded unknowns

If you can't explain why you're retrying, you don't get infinite attempts.

Defaults should be safe:

small attempt caps
strict time budgets
aggressive observability

3) Budgeted retries

Retries need a budget (time and/or attempts), so they can't silently dominate end-to-end latency.

A practical rule:

"If we can't succeed quickly, stop and surface."

4) Capped exponential backoff with jitter

Backoff is a control system, not a vibe.

exponential backoff smooths contention
jitter prevents synchronization
caps prevent retries from turning into multi-minute tail inflation

5) Load-aware behavior (don't melt your dependencies)

Retries should cooperate with backpressure.

That means:

concurrency caps
token buckets / rate limits where appropriate
making "retry storms" mechanically difficult

6) Observability is part of the contract

Every retry strategy should emit structured signals:

retry class
attempt count
chosen delay
final outcome
correlation to dependency errors/latency

If you can't see retries, you can't manage them.

What redress implements (at a high level)

Redress models retry behavior as a policy:

a classifier maps failures into retry classes
each class has its own strategy (delay function, caps, budgets)
the policy emits structured hooks so retries are measurable and debuggable

This turns retry behavior from scattered ad-hoc loops into a single owned mechanism with clear defaults.

When you should not retry

A good retry library makes "don't retry" easy.

Do not retry:

validation / schema errors
business logic invariants
known poison pills
non-idempotent side effects without protection

In those cases, correct handling is usually:

quarantine
dead-letter
operator workflow + safe replay

If you want the detailed design tradeoffs, examples, and implementation notes:

Read the deep dive (examples + implementation)

Back to writing