Module 1: LLM API Mastery · Lesson 5 of 5 · 25 min

Errors, Rate Limits & Cost Control

An agent lives or dies on the unhappy path. Rate limits, timeouts, overloaded servers, refusals, context overflows — production behavior is defined by how you handle these.

The failure taxonomy

ErrorMeaningCorrect response
429 rate_limitToo many requests/tokens per minute (RPM and TPM are separate buckets)Exponential backoff with jitter; honor retry-after header if present
529 / 503 overloadedProvider-side congestionSame backoff; consider a fallback model
400 invalid_requestYour bug: malformed messages, bad tool pairing, context overflowDon't retry — fix the request. Retrying a 400 is a infinite loop.
401 / 403Auth problemDon't retry; alert loudly
Timeout / connection errorNetwork or a very long generationRetry with backoff; set explicit client timeouts
Refusal / empty contentModel declined the taskDetect (check stop_reason/content), rephrase or escalate to a human — don't loop blindly
backoff with jitter — the canonical implementation
import random, time
import anthropic

RETRYABLE = (anthropic.RateLimitError, anthropic.APIStatusError,
             anthropic.APIConnectionError, anthropic.APITimeoutError)

def call_with_retries(fn, max_retries: int = 3, base: float = 1.0):
    for attempt in range(max_retries + 1):
        try:
            return fn()
        except anthropic.BadRequestError:
            raise                       # 400 = your bug. Never retry.
        except RETRYABLE as e:
            if attempt == max_retries:
                raise
            # exponential: 1s, 2s, 4s… + full jitter to avoid thundering herd
            delay = base * (2 ** attempt) * (0.5 + random.random())
            print(f"retryable error ({type(e).__name__}), "
                  f"sleeping {delay:.1f}s (attempt {attempt + 1})")
            time.sleep(delay)
Jitter matters: if 50 workers all fail at once and all retry after exactly 2 seconds, you've synchronized a second stampede. Randomizing the delay decorrelates them. The official SDKs have built-in retries — but agents need their own layer with logging, budgets, and per-tool policies.

Prompt caching — the agent cost lever

Agents resend a large, mostly-identical prefix every turn: system prompt, tool schemas, early conversation. Prompt caching lets the provider reuse the processed prefix — cached input tokens cost a fraction of fresh ones (Anthropic: cache reads are ~90% cheaper; writes cost a small premium) and process faster. For a 20-turn agent session whose prefix dominates, caching routinely cuts input cost by 70–90%.

cache breakpoint after the stable prefix (Anthropic)
resp = client.messages.create(
    model="claude-sonnet-4-5", max_tokens=1024,
    system=[{
        "type": "text",
        "text": LONG_SYSTEM_PROMPT,          # stable across turns
        "cache_control": {"type": "ephemeral"},   # <- cache up to here
    }],
    tools=TOOLS,                             # stable too — order matters
    messages=messages,
)
print(resp.usage.cache_read_input_tokens,    # cheap
      resp.usage.cache_creation_input_tokens)  # small premium, first call
Caching keys on an exact prefix match — reorder your tools or edit one system-prompt character and the cache misses. Structure requests as: stable stuff first (system, tools), volatile stuff last (messages). OpenAI applies prefix caching automatically on long prompts.
Cost discipline from day one
Every serious agent tracks: tokens in/out per call, cumulative per session, and estimated dollars (pull current per-MTok prices from your provider's pricing page — they change; don't hardcode from memory). Set a hard budget per session and stop the loop when it's exceeded. A bug that loops tool calls at 3 a.m. should exhaust a $2 budget, not your credit card.
Key takeaways
  • Retry 429/5xx/timeouts with exponential backoff + jitter; never retry 400s.
  • RPM and TPM are separate rate-limit buckets — big prompts can throttle you at low request rates.
  • Prompt caching: stable prefix first, cache breakpoint after it — up to ~90% cheaper input.
  • Log usage on every call; enforce a per-session dollar budget in the loop itself.
  • Detect refusals and truncation via stop_reason — silent failures poison everything downstream.