Module 1: LLM API Mastery · Lesson 5 of 5 · 25 min
Errors, Rate Limits & Cost Control
An agent lives or dies on the unhappy path. Rate limits, timeouts, overloaded servers, refusals, context overflows — production behavior is defined by how you handle these.
The failure taxonomy
| Error | Meaning | Correct response |
|---|---|---|
429 rate_limit | Too many requests/tokens per minute (RPM and TPM are separate buckets) | Exponential backoff with jitter; honor retry-after header if present |
529 / 503 overloaded | Provider-side congestion | Same backoff; consider a fallback model |
400 invalid_request | Your bug: malformed messages, bad tool pairing, context overflow | Don't retry — fix the request. Retrying a 400 is a infinite loop. |
401 / 403 | Auth problem | Don't retry; alert loudly |
| Timeout / connection error | Network or a very long generation | Retry with backoff; set explicit client timeouts |
| Refusal / empty content | Model declined the task | Detect (check stop_reason/content), rephrase or escalate to a human — don't loop blindly |
import random, time
import anthropic
RETRYABLE = (anthropic.RateLimitError, anthropic.APIStatusError,
anthropic.APIConnectionError, anthropic.APITimeoutError)
def call_with_retries(fn, max_retries: int = 3, base: float = 1.0):
for attempt in range(max_retries + 1):
try:
return fn()
except anthropic.BadRequestError:
raise # 400 = your bug. Never retry.
except RETRYABLE as e:
if attempt == max_retries:
raise
# exponential: 1s, 2s, 4s… + full jitter to avoid thundering herd
delay = base * (2 ** attempt) * (0.5 + random.random())
print(f"retryable error ({type(e).__name__}), "
f"sleeping {delay:.1f}s (attempt {attempt + 1})")
time.sleep(delay)Jitter matters: if 50 workers all fail at once and all retry after exactly 2 seconds, you've synchronized a second stampede. Randomizing the delay decorrelates them. The official SDKs have built-in retries — but agents need their own layer with logging, budgets, and per-tool policies.
Prompt caching — the agent cost lever
Agents resend a large, mostly-identical prefix every turn: system prompt, tool schemas, early conversation. Prompt caching lets the provider reuse the processed prefix — cached input tokens cost a fraction of fresh ones (Anthropic: cache reads are ~90% cheaper; writes cost a small premium) and process faster. For a 20-turn agent session whose prefix dominates, caching routinely cuts input cost by 70–90%.
resp = client.messages.create(
model="claude-sonnet-4-5", max_tokens=1024,
system=[{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # stable across turns
"cache_control": {"type": "ephemeral"}, # <- cache up to here
}],
tools=TOOLS, # stable too — order matters
messages=messages,
)
print(resp.usage.cache_read_input_tokens, # cheap
resp.usage.cache_creation_input_tokens) # small premium, first callCaching keys on an exact prefix match — reorder your tools or edit one system-prompt character and the cache misses. Structure requests as: stable stuff first (system, tools), volatile stuff last (messages). OpenAI applies prefix caching automatically on long prompts.
⚠ Cost discipline from day one
Every serious agent tracks: tokens in/out per call, cumulative per session, and estimated dollars (pull current per-MTok prices from your provider's pricing page — they change; don't hardcode from memory). Set a hard budget per session and stop the loop when it's exceeded. A bug that loops tool calls at 3 a.m. should exhaust a $2 budget, not your credit card.
Key takeaways
- ▸Retry 429/5xx/timeouts with exponential backoff + jitter; never retry 400s.
- ▸RPM and TPM are separate rate-limit buckets — big prompts can throttle you at low request rates.
- ▸Prompt caching: stable prefix first, cache breakpoint after it — up to ~90% cheaper input.
- ▸Log usage on every call; enforce a per-session dollar budget in the loop itself.
- ▸Detect refusals and truncation via
stop_reason— silent failures poison everything downstream.