Writing

Retrying LLM API Calls in Production Python (Without Burning Your Budget)

12 min read · Python · LLMs · OpenAI · Anthropic · Retries · Resilience · Operability · Failure handling

A classification-first approach to retrying OpenAI and Anthropic calls safely

On this page

Why LLM retries are different
The failure modes that matter in practice
Three anti-patterns
Classification, then response
What this buys you in production
Multi-provider fallback without magic
A note on streaming
When the SDK built-ins are enough
Where to go next

How to build retry logic for OpenAI and Anthropic that respects what actually failed.

A 429 and a 400 are both exceptions. They are not the same problem.

That sounds obvious. Production retry code ignores it all the time.

If you wrap LLM calls in a flat retry decorator, one of three things eventually happens: you retry before a rate limit resets, you keep sleeping on a quota or billing failure that needs human action, or you loop on a request the provider will never accept. The request path looks like HTTP. The failure surface does not.

That is what makes LLM retry logic different from generic "just retry the call" logic. The hard part is not backoff math. The hard part is deciding what kind of failure you are looking at before you decide how to respond.

I build and maintain redress, a policy-driven failure-handling library for Python. It recently shipped redress[openai] and redress[anthropic], and the examples below use those integrations. But the argument here is broader than the library: for LLM APIs, you want classification before response.

Why LLM retries are different

Most retry tools start from a flat model:

retry these exception types
sleep according to this backoff function
stop after N attempts

That model is fine when the failures are operationally similar. It breaks down when one endpoint can fail in several ways that demand different behavior.

A single LLM call can fail because:

you are being rate-limited and should wait for the provider's reset window
you have exhausted quota or hit a spend limit and need human or billing intervention
the provider is overloaded and may recover with bounded retries
the network flinched and a fast retry is reasonable
the request itself is bad and no retry will ever help

Those failures do not want the same response.

A throttling event wants you to honor Retry-After or provider-specific reset hints. A transport timeout wants a short retry. An overload event may justify bounded retries and maybe a circuit breaker. A prompt that exceeds the model's context window wants an immediate stop. A quota-exhaustion event wants escalation rather than optimism.

You can force a general-purpose retry library to model some of this. But once you are doing it seriously, you are no longer configuring a retry decorator. You are building a classifier and then wiring policy decisions around it.

The failure modes that matter in practice

The names differ a bit by provider, but the operational categories are stable.

1. True rate limits

The provider is telling you to slow down. This is the ordinary 429 case: too many requests, too many tokens, or some other short-horizon throttling window.

This is retryable. But it is retryable on the provider's timeline, not yours. If the server gives you Retry-After or reset headers, those hints should win over your local exponential backoff.

2. Quota or usage exhaustion

This is the subtle one.

At the HTTP layer, quota exhaustion often looks a lot like throttling: another 429. Operationally, it is a different class of problem. Backoff does not create more quota. Sleeping does not fix a billing cap. This is where flat "retry every 429" logic quietly becomes nonsense.

If your application can distinguish short-horizon throttling from out-of-quota conditions, it should.

3. Provider overload

The upstream is unhealthy. Maybe it is a 5xx. Maybe it is a provider-specific overload shape. Either way, this is a server-side retryable failure, not a client bug.

This is where bounded retries help, and where a circuit breaker becomes valuable if the incident is sustained.

4. Transport and connection failures

Connection resets, handshake failures, timeouts, proxy weirdness. These are the classic transient failures. They usually want the fastest retry cadence of the bunch.

5. Permanent request failures

Bad request. Invalid auth. Permission denied. Unknown model. Malformed body. Prompt too large. Content-policy rejection.

These should fail fast.

The expensive case here is the context-window own-goal: a prompt or request payload the API is never going to accept. If your retry layer treats that like a transient, you are paying for a bug in your own request construction.

Three anti-patterns

These show up constantly in production code, including otherwise competent codebases.

Anti-pattern 1: retrying every exception

This is the classic:

@retry(retry=retry_if_exception_type(Exception))
def call_model(...):
    ...

It looks harmless. It is not.

It turns malformed requests, auth failures, quota exhaustion, content-policy rejections, and genuine transient failures into one undifferentiated bucket. If the call is billable, or triggers downstream work on partial success paths, that mistake can get expensive quickly.

To be fair to tools like Tenacity and backoff: you can build smarter behavior with them. But once you start adding custom predicates, exception inspection, special-case backoff, and provider-specific header parsing, you are re-creating classification logic by hand.

Anti-pattern 2: ignoring server retry hints

A 429 without Retry-After is one thing. A 429 with Retry-After is the server telling you exactly when retry becomes plausible again.

If your code ignores that and just sleeps on its own exponential schedule, you are retrying blind. At best, you waste attempts. At worst, many clients converge on the same naive schedule and you help turn a brief rate limit into a longer incident.

Anti-pattern 3: treating quota exhaustion as a retryable rate limit

This is the operational cousin of anti-pattern 1.

A service that is out of quota should fail loudly and trigger action. A service that quietly backs off and retries for minutes is not being resilient. It is hiding the real problem.

Classification, then response

The alternative is straightforward:

classify the failure
choose the response based on the class

That is the model redress uses.

It keeps the error taxonomy intentionally small:

RATE_LIMIT
SERVER_ERROR
TRANSIENT
PERMANENT
CONCURRENCY
UNKNOWN

That coarseness is deliberate. The goal is not perfect diagnosis. The goal is to separate failures that want different behavior.

For LLM APIs, that buys you exactly the distinctions you care about:

throttling backs off slowly and honors provider hints
transport failures retry quickly
server-side incidents may retry and trip a breaker
bad requests, auth failures, and policy rejections stop immediately
unknown failures stay tightly bounded

A minimal OpenAI setup looks like this:

import os

from openai import OpenAI
from redress import Policy, Retry
from redress.contrib.openai import openai_aware_backoff, openai_classifier

client = OpenAI(max_retries=0)  # let one layer own retries

policy = Policy(
    retry=Retry(
        classifier=openai_classifier,
        strategy=openai_aware_backoff(max_s=30.0),
        deadline_s=120,
        max_attempts=6,
    ),
)

response = policy.call(
    lambda: client.responses.create(
        model=os.environ["OPENAI_MODEL"],
        input="Say hi",
    ),
    operation="openai.responses.create",
)

A couple of things are doing real work there.

max_retries=0 matters. The OpenAI and Anthropic Python SDKs already retry some classes of failures by default. If you layer a second retry system on top without disabling the built-ins, your attempt counts become muddy, your deadlines stop meaning what you think they mean, and your metrics under-report what actually happened.

openai_classifier matters because it centralizes provider-specific distinctions instead of scattering them across decorators and except blocks.

openai_aware_backoff matters because rate limits and some server errors should listen to server-provided retry hints before falling back to generic jitter.

The Anthropic setup is the same shape:

import os

from anthropic import Anthropic
from redress import Policy, Retry
from redress.contrib.anthropic import anthropic_aware_backoff, anthropic_classifier

client = Anthropic(max_retries=0)

policy = Policy(
    retry=Retry(
        classifier=anthropic_classifier,
        strategy=anthropic_aware_backoff(max_s=30.0),
        deadline_s=120,
        max_attempts=6,
    ),
)

message = policy.call(
    lambda: client.messages.create(
        model=os.environ["ANTHROPIC_MODEL"],
        max_tokens=512,
        messages=[{"role": "user", "content": "Say hi"}],
    ),
    operation="anthropic.messages.create",
)

What this buys you in production

The first win is safer behavior by default.

A context-window error should not keep looping. A permission problem should not sleep and try again. A throttling event should not be treated like a dropped TCP packet.

The second win is observability on the right axis.

Once retries are classified, your metrics and logs can answer questions that matter operationally:

are we slow because of rate limits?
are we in a provider-side incident?
did prompt construction regress and spike permanent failures?
are we seeing network noise or genuine upstream trouble?

That is much more useful than a single undifferentiated "retries happened" counter.

The third win is bounded behavior.

max_attempts is useful. deadline_s is just as important. If you have a 30-second handler budget, you need the retry layer to respect that budget instead of drifting past it because the next sleep looked locally reasonable.

And once you are thinking in policy terms, a circuit breaker fits naturally beside the retry layer. When an upstream is persistently unhealthy, failing fast for a cooldown window is often better than making every request rediscover the outage from scratch.

Multi-provider fallback without magic

Supporting both OpenAI and Anthropic in the same policy model is useful for a reason beyond feature parity: it makes fallback behavior easier to reason about.

The mistake here is to treat fallback as "catch any exception from provider A and try provider B." That is just the flat-retry mistake in a different costume.

Fallback should be driven by classification too.

If the primary fails with a retryable provider-side condition, fallback may make sense. If it fails with PERMANENT, you probably want to surface the failure instead of forwarding the same bad request to a second provider.

A simple version looks like this:

import os

from anthropic import Anthropic
from openai import OpenAI
from redress import ErrorClass, Policy, Retry
from redress.contrib.anthropic import anthropic_aware_backoff, anthropic_classifier
from redress.contrib.openai import openai_aware_backoff, openai_classifier

FALLBACK_WORTHY = {
    ErrorClass.RATE_LIMIT,
    ErrorClass.SERVER_ERROR,
    ErrorClass.TRANSIENT,
}

openai_client = OpenAI(max_retries=0)
anthropic_client = Anthropic(max_retries=0)

openai_policy = Policy(
    retry=Retry(
        classifier=openai_classifier,
        strategy=openai_aware_backoff(max_s=30.0),
        deadline_s=60.0,
        max_attempts=4,
    ),
)

anthropic_policy = Policy(
    retry=Retry(
        classifier=anthropic_classifier,
        strategy=anthropic_aware_backoff(max_s=30.0),
        deadline_s=60.0,
        max_attempts=4,
    ),
)


def call_with_fallback(prompt: str) -> str:
    try:
        response = openai_policy.call(
            lambda: openai_client.responses.create(
                model=os.environ["OPENAI_MODEL"],
                input=prompt,
            ),
            operation="openai.responses.create",
        )
        return response.output_text
    except Exception as exc:
        if openai_classifier(exc) not in FALLBACK_WORTHY:
            raise

    message = anthropic_policy.call(
        lambda: anthropic_client.messages.create(
            model=os.environ["ANTHROPIC_MODEL"],
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}],
        ),
        operation="anthropic.messages.create",
    )
    return message.content[0].text

Whether RATE_LIMIT belongs in FALLBACK_WORTHY depends on your system: fallback helps when the primary is provider-limited and the secondary has independent headroom, but not when your own burstiness would immediately saturate the secondary too.

This is deliberately not a built-in failover primitive.

Fallback is application policy. Some teams will fall back on rate limits but not on transport errors. Some will fall back on overload but never on unknown failures. Some will avoid fallback for prompts that are already near a second provider's context ceiling. Those are application decisions.

The value of a shared policy layer is that both providers emit comparable signals and obey the same control surface while still preserving provider-specific classification.

A note on streaming

Streaming is the place where no retry abstraction can make the problem disappear.

Once tokens have started flowing, a dropped connection is not the same thing as a failed request initiation. Retrying the whole request may duplicate output, cost, and downstream side effects.

So the right question is not "can the library transparently retry streaming?" It usually cannot, at least not honestly.

The safe pattern is to make the retried unit explicit:

retry request initiation if that is useful
treat mid-stream failure as an application concern
be explicit about restart, resume, deduplication, or user-visible partial output

That is not a limitation of redress specifically. It is the nature of streaming APIs.

When the SDK built-ins are enough

If you are making one low-stakes call in a script, the SDK defaults are fine. The OpenAI and Anthropic Python SDKs both retry some connection, rate-limit, timeout, conflict, and 5xx failures automatically. You do not need a separate policy library for a throwaway script.

The threshold where a dedicated layer starts paying for itself is when one or more of these become true:

you care about the difference between throttling, overload, transport, and permanent failures
you want deadline semantics, not just attempt caps
you want a circuit breaker
you run more than one provider and want one failure model across them
you want your telemetry to explain why retries happened and why they stopped

That is the line. Below it, keep it simple. Above it, flat retries become hard to justify.

Where to go next

The point of the LLM integrations is not that LLMs are uniquely magical. It is the opposite.

They are another place where production systems need failure handling to be explicit, bounded, and observable. They just happen to make the limits of flat retry logic unusually obvious.

If any of the three anti-patterns above described your current retry layer, the integrations are designed to make those mistakes impossible by construction.

Install with pip install redress[openai] or pip install redress[anthropic].

Back to writing