Module 1: LLM API Mastery · Lesson 3 of 5 · 30 min

Tool Calling End-to-End

The mechanism that turns a text generator into something that can act. Crucial mental model: the model never executes anything — it emits structured JSON, and your code does the work.

Key insight
Tool calling is just structured output plus a convention. The model generates JSON that matches a schema you provided; you run the corresponding function; you append the result to the messages; the model continues. The model has no network access, no filesystem, no side effects — you are its hands.
Your appLLM APImessages + tool schemasstop_reason: "tool_use" → get_weather({city:"Tokyo"})tool_result: {"temp": 21, "sky": "clear"}"It's 21°C and clear in Tokyo."
One complete tool-use round trip: schemas in, tool_use out, tool_result in, final answer out.

The four-step dance

  1. You send messages plus tools: each tool has a name, description, and a JSON-schema input_schema for its parameters.
  2. Model decides it needs a tool: the response contains a tool_use block (Anthropic) / tool_calls array (OpenAI) with the tool name, generated arguments, and a unique id. stop_reason is "tool_use".
  3. You execute the actual function with those arguments, then append (a) the assistant message verbatim, and (b) a tool_result referencing the same id, with the output as a string.
  4. Model continues — it may answer, or request another tool. Loop until stop_reason is "end_turn".
complete working tool loop (Anthropic, raw SDK)
import json
import anthropic

client = anthropic.Anthropic()

TOOLS = [{
    "name": "get_weather",
    "description": (
        "Get current weather for a city. Use whenever the user asks about "
        "weather, temperature, or outdoor conditions. Returns Celsius."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name, e.g. 'Tokyo'"},
        },
        "required": ["city"],
    },
}]

def get_weather(city: str) -> str:
    return json.dumps({"city": city, "temp_c": 21, "sky": "clear"})  # stub

messages = [{"role": "user", "content": "Should I bike to work in Tokyo today?"}]

while True:
    resp = client.messages.create(
        model="claude-sonnet-4-5", max_tokens=1024,
        tools=TOOLS, messages=messages,
    )
    if resp.stop_reason != "tool_use":
        print(resp.content[0].text)
        break

    # 1) append the assistant turn EXACTLY as returned
    messages.append({"role": "assistant", "content": resp.content})

    # 2) run every requested tool, append results
    results = []
    for block in resp.content:
        if block.type == "tool_use":
            output = get_weather(**block.input)      # your code acts
            results.append({
                "type": "tool_result",
                "tool_use_id": block.id,             # must match!
                "content": output,
            })
    messages.append({"role": "user", "content": results})
Two invariants trip everyone up: the assistant message containing tool_use must be resent verbatim, and every tool_result must reference a real tool_use_id from the immediately preceding assistant turn. Return a result for a tool that was never called (or drop one that was) and the API rejects the request with a 400 — the strict pairing is how the model keeps causality straight.

OpenAI's shape, for comparison

same dance, different field names
resp = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city.",
            "parameters": {          # 'parameters', not 'input_schema'
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }],
)
msg = resp.choices[0].message
if msg.tool_calls:
    messages.append(msg)             # assistant turn, verbatim
    for tc in msg.tool_calls:
        args = json.loads(tc.function.arguments)   # arrives as a STRING
        messages.append({
            "role": "tool",                        # dedicated role
            "tool_call_id": tc.id,
            "content": get_weather(**args),
        })
Key differences: OpenAI nests schemas under function.parameters, arguments arrive as a JSON string you must parse (and which can be malformed — validate!), and results use a dedicated tool role rather than a block inside a user message. The concepts are identical; only the plumbing differs.
Tool descriptions are prompts
The model chooses tools by reading their names and descriptions — nothing else. A bad description ("weather tool") yields wrong tool choices and garbage arguments. A good one says what the tool does, when to use it, what it returns, and its units/limits. Anthropic's own guidance: extremely detailed descriptions are the single highest-leverage factor in tool-use quality.
Key takeaways
  • The model requests; your code executes. All side effects are yours.
  • Loop on stop_reason == "tool_use"; resend assistant turns verbatim; match tool_use_id exactly.
  • Multiple tool calls can arrive in one turn — answer all of them.
  • Tool errors go back as tool_result content (with is_error: true on Anthropic) so the model can recover.
  • Invest in tool descriptions like you invest in prompts — they are prompts.