Messages Are the Only State
The single most important fact in agent engineering: the model is stateless. A 'conversation' is you resending an ever-growing array. Every agent pattern you'll ever build follows from this.
When you chat with Claude or ChatGPT, it feels like the model remembers you. It doesn't. Every single API call is a blank slate. The provider's server receives your request, runs the model over the tokens you sent, returns a completion, and forgets you existed. What creates the illusion of memory is that your code resends the entire conversation history on every call.
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY
messages = [] # <- this list IS the conversation. You own it.
def chat(user_text: str) -> str:
messages.append({"role": "user", "content": user_text})
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system="You are a concise engineering assistant.",
messages=messages, # full history, every single time
)
reply = response.content[0].text
messages.append({"role": "assistant", "content": reply})
return reply
print(chat("My name is Wenming."))
print(chat("What's my name?")) # works ONLY because we resent turn 1messages.append and the model instantly 'forgets' — because memory never lived on the server. The system prompt rides along outside the array in Anthropic's API; in OpenAI's it's the first message with role: "system" (or "developer" in newer APIs).The roles
| Role | Who writes it | What it's for |
|---|---|---|
system | You (the developer) | Standing instructions: persona, rules, tool guidance. Highest-priority steering. |
user | The human (or your code) | The task, questions, tool results in OpenAI's flow. |
assistant | The model | Text replies and tool-call requests. You resend these verbatim. |
tool / tool_result | You, after executing | The output of a tool the model asked you to run. OpenAI: role: "tool"; Anthropic: a tool_result block inside a user message. |
Tokens: the currency of everything
Models don't see characters — they see tokens, subword chunks (roughly 3–4 English characters, ~¾ of a word each). Every model has a context window: the maximum tokens of input + output it can handle in one call (hundreds of thousands of tokens on frontier models — check your model's docs, these numbers change). You pay per token, input and output priced separately, and output tokens typically cost several times more.
Here's the trap that surprises everyone: because you resend history every turn, cost grows quadratically with conversation length. Turn 10 doesn't cost one turn's tokens — it re-processes turns 1–9 as input, plus its own. A 10-turn conversation averaging 500 tokens per turn means turn 10's call alone sends ~4,500 input tokens, and the whole session has processed ~25,000 cumulative input tokens.
# Anthropic: exact count via the API (free, no generation)
count = client.messages.count_tokens(
model="claude-sonnet-4-5",
system="You are a concise engineering assistant.",
messages=messages,
)
print(count.input_tokens)
# OpenAI: tiktoken locally
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
n = len(enc.encode("How many tokens is this sentence?"))
# Every response also reports usage — log it on EVERY call:
resp = client.messages.create(model="claude-sonnet-4-5",
max_tokens=256, messages=messages)
print(resp.usage.input_tokens, resp.usage.output_tokens)usage on every call and aggregate per session/user/day. You cannot manage what you don't measure — cost bugs (like accidentally resending a huge document every turn) hide in unlogged usage.- ▸The model is stateless; conversation = resending the array. Your code owns all state.
- ▸Roles:
system(rules),user(task + tool results),assistant(replies + tool calls). - ▸Cost grows quadratically with turns because history is re-sent as input every call.
- ▸Log
usagefrom every response; count tokens before sending big payloads. - ▸When history exceeds the window: truncate oldest turns, or summarize them (Module 4 goes deep).