Module 1: LLM API Mastery · Lesson 1 of 5 · 25 min

Messages Are the Only State

The single most important fact in agent engineering: the model is stateless. A 'conversation' is you resending an ever-growing array. Every agent pattern you'll ever build follows from this.

When you chat with Claude or ChatGPT, it feels like the model remembers you. It doesn't. Every single API call is a blank slate. The provider's server receives your request, runs the model over the tokens you sent, returns a completion, and forgets you existed. What creates the illusion of memory is that your code resends the entire conversation history on every call.

the entire illusion of conversation
import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY

messages = []  # <- this list IS the conversation. You own it.

def chat(user_text: str) -> str:
    messages.append({"role": "user", "content": user_text})
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system="You are a concise engineering assistant.",
        messages=messages,        # full history, every single time
    )
    reply = response.content[0].text
    messages.append({"role": "assistant", "content": reply})
    return reply

print(chat("My name is Wenming."))
print(chat("What's my name?"))   # works ONLY because we resent turn 1
Comment out the second messages.append and the model instantly 'forgets' — because memory never lived on the server. The system prompt rides along outside the array in Anthropic's API; in OpenAI's it's the first message with role: "system" (or "developer" in newer APIs).

The roles

RoleWho writes itWhat it's for
systemYou (the developer)Standing instructions: persona, rules, tool guidance. Highest-priority steering.
userThe human (or your code)The task, questions, tool results in OpenAI's flow.
assistantThe modelText replies and tool-call requests. You resend these verbatim.
tool / tool_resultYou, after executingThe output of a tool the model asked you to run. OpenAI: role: "tool"; Anthropic: a tool_result block inside a user message.
Key insight
Because the model is stateless, an 'agent' is just a program that keeps editing a message array in a loop. Adding memory, compacting context, injecting retrieved documents, resuming a crashed session — all of it is list manipulation. Master the array and the rest of this curriculum is variations on a theme.

Tokens: the currency of everything

Models don't see characters — they see tokens, subword chunks (roughly 3–4 English characters, ~¾ of a word each). Every model has a context window: the maximum tokens of input + output it can handle in one call (hundreds of thousands of tokens on frontier models — check your model's docs, these numbers change). You pay per token, input and output priced separately, and output tokens typically cost several times more.

context window (finite budget)systemtoolshistorytool resultsnew turn⚠ approaching limit → compactsystemsummary ✦recent turns← reclaimed budgetold turns are summarized; system prompt and recent turns survive verbatim
A growing conversation eats the context window; compaction reclaims budget by summarizing old turns.

Here's the trap that surprises everyone: because you resend history every turn, cost grows quadratically with conversation length. Turn 10 doesn't cost one turn's tokens — it re-processes turns 1–9 as input, plus its own. A 10-turn conversation averaging 500 tokens per turn means turn 10's call alone sends ~4,500 input tokens, and the whole session has processed ~25,000 cumulative input tokens.

count before you send
# Anthropic: exact count via the API (free, no generation)
count = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    system="You are a concise engineering assistant.",
    messages=messages,
)
print(count.input_tokens)

# OpenAI: tiktoken locally
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
n = len(enc.encode("How many tokens is this sentence?"))

# Every response also reports usage — log it on EVERY call:
resp = client.messages.create(model="claude-sonnet-4-5",
                              max_tokens=256, messages=messages)
print(resp.usage.input_tokens, resp.usage.output_tokens)
Production agents log usage on every call and aggregate per session/user/day. You cannot manage what you don't measure — cost bugs (like accidentally resending a huge document every turn) hide in unlogged usage.
Key takeaways
  • The model is stateless; conversation = resending the array. Your code owns all state.
  • Roles: system (rules), user (task + tool results), assistant (replies + tool calls).
  • Cost grows quadratically with turns because history is re-sent as input every call.
  • Log usage from every response; count tokens before sending big payloads.
  • When history exceeds the window: truncate oldest turns, or summarize them (Module 4 goes deep).