Writing

Why redress

3 min read · Retries · Operability

My retry philosophy: classification-first, bounded unknowns, capped exponential backoff, and observability hooks.

Short version. The full deep dive includes design tradeoffs, examples, and implementation notes.

Most retry logic in the wild is either too naive ("just retry") or too timid ("never retry"). In real systems, partial failure is normal, dependencies are flaky, and tail latency is where user experience goes to die.

My bias is simple: retries should improve outcomes without amplifying load or hiding real failures. Redress exists to make that behavior explicit, predictable, and observable.

The problem with naive retries

Naive retries fail in predictable ways:

  • They amplify load. When a dependency is unhealthy, retries increase traffic exactly when the dependency can least handle it.
  • They shift work into the tail. You might "succeed" more often, but your p95/p99 explodes.
  • They synchronize clients. Exponential backoff without jitter creates thundering herds.
  • They hide incidents. The system looks "fine" until it isn't - and then you have no clear signal about what actually happened.
  • They treat all failures the same. Timeouts, 429s, validation errors, and logic bugs should not share a strategy.

The result: retries become a silent control plane that nobody owns.

The philosophy: policy, not hope

Redress is built around a small set of rules.

1) Classification-first

Retry is not a boolean. You need classes of failure with different behavior.

Examples:

  • Transient (timeouts, connect resets) -> retry cautiously
  • Capacity / throttling (429s, queue full) -> retry with stronger backoff + lower concurrency
  • Permanent (validation, schema mismatch) -> don't retry; quarantine or drop
  • Unknown -> bounded retries with tight budgets (fail fast)

2) Bounded unknowns

If you can't explain why you're retrying, you don't get infinite attempts.

Defaults should be safe:

  • small attempt caps
  • strict time budgets
  • aggressive observability

3) Budgeted retries

Retries need a budget - time and/or attempts - so they can't silently dominate end-to-end latency.

A practical rule:

  • "If we can't succeed quickly, stop and surface."

4) Capped exponential backoff with jitter

Backoff is a control system, not a vibe.

  • exponential backoff smooths contention
  • jitter prevents synchronization
  • caps prevent retries from turning into multi-minute tail inflation

5) Load-aware behavior (don't melt your dependencies)

Retries should cooperate with backpressure.

That means:

  • concurrency caps
  • token buckets / rate limits where appropriate
  • making "retry storms" mechanically difficult

6) Observability is part of the contract

Every retry strategy should emit structured signals:

  • retry class
  • attempt count
  • chosen delay
  • final outcome
  • correlation to dependency errors/latency

If you can't see retries, you can't manage them.

What redress implements (at a high level)

Redress models retry behavior as a policy:

  • a classifier maps failures into retry classes
  • each class has its own strategy (delay function, caps, budgets)
  • the policy emits structured hooks so retries are measurable and debuggable

This turns retry behavior from scattered ad-hoc loops into a single owned mechanism with clear defaults.

When you should not retry

A good retry library makes "don't retry" easy.

Do not retry:

  • validation / schema errors
  • business logic invariants
  • known poison pills
  • non-idempotent side effects without protection

In those cases, correct handling is usually:

  • quarantine
  • dead-letter
  • operator workflow + safe replay

If you want the detailed design tradeoffs, examples, and implementation notes: