Tool-Calling CLI Assistant

Build a CLI assistant from scratch — raw SDK only, no frameworks — that answers questions using three tools: calculator, get_current_time, and read_file. This is the atom every later lab is built from. Starter code lives in labs/lab01-agent-loop/.

What you're building

A terminal REPL: the user types a question, your loop calls the model with three tool schemas, executes whatever the model requests, feeds results back, and prints the final answer — plus a running token/cost report. It must survive multi-step questions like "what's 3 more than the number in numbers.txt?" (read_file → calculator → answer).

LLMreason + decideToolsexecuteUser taskFinal answerloop until done ↺tool_call(name, args)tool result → appended to messages
Your lab in one picture: loop until stop_reason is end_turn.

Suggested structure

skeleton (fill in the TODOs)
# tools.py — implementations + schemas
def calculator(expression: str) -> str:
    # SAFELY evaluate arithmetic. No eval() on raw input —
    # use ast.literal_eval-style parsing or a tiny recursive parser.
    ...

def get_current_time(timezone: str = "UTC") -> str: ...
def read_file(path: str) -> str:
    # constrain to the working directory; return a clear error string
    # (not an exception) when the file doesn't exist
    ...

# agent.py — the loop
def run_turn(messages, budget):
    while True:
        resp = call_with_retries(lambda: client.messages.create(
            model=MODEL, max_tokens=1024, tools=SCHEMAS, messages=messages))
        budget.add(resp.usage)                 # track every call
        if resp.stop_reason != "tool_use":
            return resp
        messages.append({"role": "assistant", "content": resp.content})
        messages.append({"role": "user", "content": execute_all(resp.content)})
Design decisions that matter: tool errors are returned as strings (with is_error: true) so the model can recover; the calculator must not eval() arbitrary input; the loop needs a max-iteration guard so a confused model can't spin forever.
Acceptance criteria — all must pass
  • Raw SDK only (anthropic or openai package) — no LangChain or agent frameworks
  • Multi-turn tool use works: a question requiring two sequential tool calls succeeds ("what's 3 more than the number in numbers.txt?")
  • Tool errors (file not found, division by zero) are returned to the model, which recovers gracefully — the loop never crashes
  • Prints total tokens + estimated cost per session (rates pulled into one constant you can update)
  • Retries API errors with exponential backoff + jitter (max 3 attempts), never retries 400s
  • test_agent.py passes
Stretch goals
  • Stream the final answer token-by-token while still handling tool-use turns
  • Add prompt caching with a cache breakpoint after the system prompt + tools, and log cache-read savings
  • Practical test: re-implement the minimal one-tool loop from memory in under 30 minutes

Be honest — the gates only mean something if the criteria really pass.