Writing

Why recourse

6 min read · Resilience · Operability · Go · Retries

Policy-driven resilience for Go services: consistent retries, explicit backpressure budgets, hedging, and circuit breaking - with explainable observability.

Short version. The full deep dive includes design tradeoffs, examples, and implementation notes.

On this page

The core problem: retry loops aren’t the hard part
Design bet #1: move resilience out of the call site (policy-driven envelopes)
Design bet #2: keys must be low-cardinality (or you’ll regret it)
Guardrails aren’t optional: bounded envelopes + policy normalization
Not every failure is retryable: classifiers instead of heuristics
Observability-first: retries are only “safe” when they’re explainable
Backpressure: budgets make retries outage-safe
Tail latency is also reliability: hedging
When it’s failing, stop sending traffic: circuit breaking
Operate it like a system: remote configuration
Integration philosophy: standard library first, dependencies opt-in
Getting started: three depth levels
Why this design matters

Retry logic is one of those things that always feels easy, until it isn’t.

Most of the time it gets added under pressure:

“Just retry a couple times.”
“Add exponential backoff.”
“Try it again if it fails.”

It works in dev. It mostly works in staging. Then production hits: a dependency gets flaky, latency spikes, or a downstream starts rate-limiting…and suddenly “simple retries” turn a small blip into a real incident.

recourse is my attempt to treat resilience like an operational guarantee, not a call-site habit: consistent, bounded, and observable - with explicit backpressure so retries don’t become self-inflicted outages.

The core problem: retry loops aren’t the hard part

A retry loop is trivial. The hard part is everything you eventually need around it:

Is the failure retryable (500 vs 404 vs 429)?
Are you respecting per-attempt vs overall timeouts?
Do you have explicit backpressure so an outage doesn’t become a retry storm?
When it’s 3am, can you answer what happened on each attempt... and why?

If those questions don’t have crisp answers, your “retry logic” is really just latent incident fuel.

Design bet #1: move resilience out of the call site (policy-driven envelopes)

Most resilience tooling is mechanism-first: you configure attempts and delays right where you call the dependency.

recourse flips that around: the call site supplies a policy key, and the resilience envelope is defined centrally by policy.

That policy can control:

max attempts
backoff + jitter
per-attempt and overall timeouts
how errors/results get interpreted (classification)
backpressure budgets
hedging behavior
circuit breaking behavior

The goal is simple: call sites should be boring. Resilience behavior should be consistent, tunable, and auditable without re-implementing “retry loops” everywhere.

Design bet #2: keys must be low-cardinality (or you’ll regret it)

Once the key is the unit of control, it also becomes the unit of:

policy selection
observability dimensions
caches and trackers (including latency tracking for hedging)

So the key cannot include things like user IDs, request IDs, or per-entity routing details.

If your keys are high-cardinality, you get:

unbounded memory growth
unusable metrics
confusing operational data

This isn't about being pedantic. It's about the system surviving.

Guardrails aren’t optional: bounded envelopes + policy normalization

One reason retries go off the rails is configuration drift:

too many attempts
tiny timeouts that busy-loop
backoff values that don’t make sense
invalid enum values

recourse treats unsafe configuration as a first-class failure mode. Policies are normalized and clamped into documented safe ranges, with metadata captured so you can see what got changed and why.

And when policy resolution fails, behavior is explicit and conservative by default. You don’t want “silent fallback” to become your production personality.

Not every failure is retryable: classifiers instead of heuristics

Naive retries treat all failures the same. Real systems don’t.

recourse makes this explicit via classification: map (value, error) into an outcome, and let the executor decide whether to retry, stop, or abort.

This matters most at the sharp edges:

HTTP semantics (5xx vs 404 vs 429, Retry-After)
gRPC status codes
type mismatches between what a classifier expects and what it receives

A key safety rule: if something doesn’t match the expected shape, fail loudly and safely... don’t degrade into “retry blindly.”

Observability-first: retries are only “safe” when they’re explainable

Retries hide behavior unless you surface it deliberately. recourse supports two complementary paths:

1) A per-call timeline (for debugging)

When you need to understand a single call, you can capture a structured timeline: attempts, outcomes, delays, errors, budget decisions, hedges, and call-level metadata (policy source, normalization/clamping).

This is the fastest way to answer: “What happened on each attempt, and why?”

2) Streaming observer hooks (for metrics/logs/traces)

For normal operations, an observer gets lifecycle events: start/success/failure, attempt decisions, hedge events, budget decisions, circuit breaker decisions, with standardized reasons.

Point being: recourse doesn’t just retry. It tells you exactly what it did.

Backpressure: budgets make retries outage-safe

Retries and hedges multiply load. If a dependency is struggling, naive retries can turn “slow” into “down.”

recourse treats backpressure as part of the retry contract: every attempt can be gated by a budget:

allow
deny (with a recorded reason)
optionally reserve/release capacity

Budgets can be simple (unlimited) or real (token buckets with refill rates). The key idea: you should be able to say, mechanically, “this attempt is not allowed right now” and have that show up in telemetry as an explicit decision rather than a mystery.

Tail latency is also reliability: hedging

Even if averages are fine, p99/p999 spikes can dominate user experience and upstream timeouts.

recourse supports hedging: start another attempt while the first is still in flight, race them, and let first success win but only when it’s configured and budgeted.

There are two common hedging triggers:

fixed delay (spawn after N ms)
latency-aware (spawn based on recent tail latency tracking)

Hedging is powerful and dangerous. It works only when paired with budgets, deterministic cancellation behavior, and first-class observability.

When it’s failing, stop sending traffic: circuit breaking

Retries help when failures are transient. When a downstream is persistently failing, retries just add fuel.

recourse includes a standard circuit breaker model (closed/open/half-open), integrated with the rest of the execution loop so you don’t bolt “breaker behavior” on as an afterthought.

Operate it like a system: remote configuration

If you buy the policy-driven model, the next step is obvious: update policies without redeploying.

recourse supports a remote policy provider with:

TTL caching for fetched policies
negative caching for missing policies (to prevent hot-spotting on nonexistent keys)
explicit fallback behavior when the source is unavailable

This is the operational through-line: centralize control, avoid hammering your control plane, and make fallback behavior explicit.

Integration philosophy: standard library first, dependencies opt-in

In Go, resilience tooling often turns into “a framework.”

recourse tries hard not to:

integrations target standard interfaces (e.g., net/http, gRPC interceptors)
heavy deps live in separate modules
helpers handle correctness edge cases you don’t want every call site re-learning (like HTTP response-body handling on retries)

Getting started: three depth levels

You should be able to adopt resilience incrementally:

Level 1: “Just do it” - minimal friction, best-effort defaults
Level 2: Make it explainable - capture timelines when debugging real behavior
Level 3: Own the behavior - explicit executors, registries, standardized defaults across a service fleet

Why this design matters

recourse isn’t trying to out-feature every retry library. It’s trying to make resilience behavior:

policy-driven (keys select behavior, not ad-hoc config)
semantic (classification rather than “retry on any error”)
bounded (timeouts, caps, normalization)
outage-safe (budgets/backpressure)
visible (timelines + observer hooks)
composable (hedging and circuit breaking in one coherent model)

If you want the detailed design tradeoffs, examples, and implementation notes:

Read the deep dive (examples + implementation)

Back to writing