Multi-Agent Research System with Single-Agent Baseline
Build a planner → parallel searchers → writer → critic research system in LangGraph that answers questions with a cited brief — checkpointed, resumable, with a human approval gate and logged structured handoffs — then benchmark it honestly against a single agent with the same tools. Starter code lives in labs/lab05-multi-agent/.
What you're building
A research question goes in; a cited brief comes out. The planner decomposes the question into 2–3 subtasks; parallel searchers (web or a provided corpus) gather evidence, one clean context each; the writer integrates findings into a draft; the critic reviews against a rubric with at most one revision loop; a human gate pauses the graph for approve/reject-with-feedback before the final answer. Every handoff is a structured brief, logged to JSONL. Then the part that makes this a portfolio piece: the same 10 questions through a single agent with the same tools, and an honest comparison table.
Suggested structure
# state.py
import operator
from typing import Annotated, TypedDict
class ResearchState(TypedDict):
question: str
plan: list[str]
findings: Annotated[list[str], operator.add] # parallel-safe
draft: str
critique: str
revision_count: int
# handoff.py — HandoffBrief (pydantic) + log_handoff() -> handoffs.jsonl
# graph.py
builder = StateGraph(ResearchState)
builder.add_node("planner", planner) # decompose via structured output
builder.add_node("searcher", searcher) # fanned out per subtask
builder.add_node("writer", writer)
builder.add_node("critic", critic)
builder.add_node("human_gate", human_gate) # interrupt() lives here
builder.add_edge(START, "planner")
# TODO: fan-out dispatch planner -> N searcher runs (Send-style API)
builder.add_edge("searcher", "writer")
builder.add_edge("writer", "critic")
builder.add_conditional_edges("critic", route_after_critic,
{"revise": "writer", "gate": "human_gate"})
builder.add_conditional_edges("human_gate", route_after_gate,
{"revise": "writer", "done": END})
graph = builder.compile(checkpointer=sqlite_checkpointer) # durable!
# baseline.py — single agent, SAME tools, same model, same 10 questions
# compare.py — runs both, sums usage across ALL calls, judges pairwiserevision_count caps the critic loop at one revision per the spec.- ☐LangGraph graph with: planner node (decomposes the question), 2–3 parallel search workers (web or corpus), writer, and critic with at most one revision loop
- ☐Typed state schema; checkpointing enabled; the run can resume after a killed process
- ☐HITL interrupt: before the final answer, the graph pauses for human approve or reject-with-feedback
- ☐Handoffs pass structured briefs (not raw transcripts); every handoff payload is logged
- ☐Baseline comparison: the same 10 questions through a single agent with the same tools; report quality (LLM-as-judge + your own rubric), cost, and latency for both, with an honest conclusion — if single-agent wins, say so
- ☐README with an architecture diagram and the comparison table
- ◇Time-travel demo: use the checkpoint history to fork a run from before the writer step with a modified plan, and diff the outcomes
- ◇Add a token/cost budget to state that any node can trip, terminating the graph gracefully with a partial-results report
- ◇Practical test rehearsal: kill the process mid-run on camera, resume from the checkpoint, and narrate your baseline numbers as if defending them in an interview
Be honest — the gates only mean something if the criteria really pass.