Module 1: LLM API Mastery · Lesson 5 of 5 · 25 min

Errors, Rate Limits & Cost Control

An agent lives or dies on the unhappy path. Rate limits, timeouts, overloaded servers, refusals, context overflows — production behavior is defined by how you handle these.

The failure taxonomy

Error	Meaning	Correct response
`429 rate_limit`	Too many requests/tokens per minute (RPM and TPM are separate buckets)	Exponential backoff with jitter; honor `retry-after` header if present
`529 / 503 overloaded`	Provider-side congestion	Same backoff; consider a fallback model
`400 invalid_request`	Your bug: malformed messages, bad tool pairing, context overflow	Don't retry — fix the request. Retrying a 400 is a infinite loop.
`401 / 403`	Auth problem	Don't retry; alert loudly
Timeout / connection error	Network or a very long generation	Retry with backoff; set explicit client timeouts
Refusal / empty content	Model declined the task	Detect (check `stop_reason`/content), rephrase or escalate to a human — don't loop blindly

backoff with jitter — the canonical implementation

import random, time
import anthropic

RETRYABLE = (anthropic.RateLimitError, anthropic.APIStatusError,
             anthropic.APIConnectionError, anthropic.APITimeoutError)

def call_with_retries(fn, max_retries: int = 3, base: float = 1.0):
    for attempt in range(max_retries + 1):
        try:
            return fn()
        except anthropic.BadRequestError:
            raise                       # 400 = your bug. Never retry.
        except RETRYABLE as e:
            if attempt == max_retries:
                raise
            # exponential: 1s, 2s, 4s… + full jitter to avoid thundering herd
            delay = base * (2 ** attempt) * (0.5 + random.random())
            print(f"retryable error ({type(e).__name__}), "
                  f"sleeping {delay:.1f}s (attempt {attempt + 1})")
            time.sleep(delay)

Jitter matters: if 50 workers all fail at once and all retry after exactly 2 seconds, you've synchronized a second stampede. Randomizing the delay decorrelates them. The official SDKs have built-in retries — but agents need their own layer with logging, budgets, and per-tool policies.

Prompt caching — the agent cost lever

Agents resend a large, mostly-identical prefix every turn: system prompt, tool schemas, early conversation. Prompt caching lets the provider reuse the processed prefix — cached input tokens cost a fraction of fresh ones (Anthropic: cache reads are ~90% cheaper; writes cost a small premium) and process faster. For a 20-turn agent session whose prefix dominates, caching routinely cuts input cost by 70–90%.

cache breakpoint after the stable prefix (Anthropic)

resp = client.messages.create(
    model="claude-sonnet-4-5", max_tokens=1024,
    system=[{
        "type": "text",
        "text": LONG_SYSTEM_PROMPT,          # stable across turns
        "cache_control": {"type": "ephemeral"},   # <- cache up to here
    }],
    tools=TOOLS,                             # stable too — order matters
    messages=messages,
)
print(resp.usage.cache_read_input_tokens,    # cheap
      resp.usage.cache_creation_input_tokens)  # small premium, first call

Caching keys on an exact prefix match — reorder your tools or edit one system-prompt character and the cache misses. Structure requests as: stable stuff first (system, tools), volatile stuff last (messages). OpenAI applies prefix caching automatically on long prompts.

⚠ Cost discipline from day one

Every serious agent tracks: tokens in/out per call, cumulative per session, and estimated dollars (pull current per-MTok prices from your provider's pricing page — they change; don't hardcode from memory). Set a hard budget per session and stop the loop when it's exceeded. A bug that loops tool calls at 3 a.m. should exhaust a $2 budget, not your credit card.

Key takeaways

▸Retry 429/5xx/timeouts with exponential backoff + jitter; never retry 400s.
▸RPM and TPM are separate rate-limit buckets — big prompts can throttle you at low request rates.
▸Prompt caching: stable prefix first, cache breakpoint after it — up to ~90% cheaper input.
▸Log usage on every call; enforce a per-session dollar budget in the loop itself.
▸Detect refusals and truncation via stop_reason — silent failures poison everything downstream.