Capstone — Autonomous Coding Agent (Issue → Tested, HITL-Gated PR)
Build the portfolio anchor: an agent that takes a GitHub issue, explores the repo, implements a fix in a sandbox, writes a reproducing test and iterates to green with bounded retries, and opens a draft PR gated on human approval. Ship it with full tracing, a per-issue cost report, an evaluation across ≥10 issues with a partial-success taxonomy, and a frank limitations doc. Then turn it into Gate G4 interview material.
What you're building
An end-to-end coding agent scoped honestly ('simple bug-fix issues in Python repos <10k LOC with a pytest suite'). It ingests an issue, maps the repo and states its understanding and plan (checkpointed), edits in a git worktree or container (never the real tree), runs the test suite and writes a reproducing test (red → green) iterating up to 5 attempts, and opens a draft PR only after a human approves a view of the diff, test results, and cost. It's instrumented with Module 7 tracing and reports cost per issue.
Suggested weekly plan
- W21: harness + repo exploration + plan generation working end-to-end on one toy issue.
- W22: sandboxed editing + test running; first full red→green fix.
- W23: retry loops, failure handling, HITL PR gate.
- W24: eval set of 10 issues; run, measure, fix the top failure mode. (Also: full take-home simulation this week.)
- W25: eval rerun, results write-up, README polish, demo recording.
- W26: buffer + Gate G4 full mock loop.
# capstone/pipeline.py — glue for the six stages
from langfuse import observe # tracing from Module 7
@observe()
def solve_issue(issue_url_or_path: str, repo_full: str) -> dict:
issue = load_issue(issue_url_or_path) # TODO: GitHub API or local file
workspace = make_sandbox(repo_full) # TODO: git worktree OR container
try:
plan = explore_and_plan(issue.text) # checkpointed to plan.json
outcome = repair(plan, issue.text) # bounded red->green loop (max 5)
tests = run_tests() # final verification
diff = git_diff(workspace) # TODO: git diff in the sandbox
cost = current_trace_cost() # TODO: pull from your tracing
if outcome["status"] == "success" and tests["passed"]:
url = open_pr_if_approved( # HITL gate — draft PR only on 'y'
repo_full, branch=workspace.branch, base="main",
title=pr_title(issue), body=pr_body(issue, outcome),
diff=diff, cost_usd=cost,
)
return {"issue": issue.id, "result": "pr_proposed", "url": url,
"cost_usd": cost}
return {"issue": issue.id, "result": outcome["status"],
"cost_usd": cost}
finally:
cleanup_sandbox(workspace) # discard the throwaway tree
# capstone/evaluate.py — run across the issue set, score with the taxonomy
def evaluate(issue_set) -> dict:
results = []
for spec in issue_set: # TODO: >= 10, real + seeded
run = solve_issue(spec.url, spec.repo)
results.append(score_to_taxonomy(run, spec.ground_truth))
return report(results) # success rate + taxonomy + cost/timefinally: cleanup_sandbox is not optional — the agent WILL make bad edits, and they must die with the throwaway tree. The evaluate module reuses the exact same pipeline, scored against ground truth you control, producing the numbers your README and interviews need.- ▸This is the portfolio anchor: issue → explore → sandboxed fix → red→green tests → HITL-gated draft PR, fully traced.
- ▸Checkpoint the plan; bound the repair loop; verify tests independently of the model's claim.
- ▸Sandbox every edit (worktree/container) and gate every PR behind human approval — non-negotiable.
- ▸Evaluate on ≥10 issues (real + seeded) with a partial-success taxonomy and cost/time per issue.
- ▸Ship a frank limitations doc and a README with architecture, demo, and results.
- ▸Convert it all into Gate G4 material: design answers, five STAR stories with numbers, take-home strategy.
- ☐Input: accepts a GitHub issue URL or a local issue file (title, body, repro steps)
- ☐Explore: maps the repo, locates relevant code, and states its bug understanding and plan; the plan is checkpointed
- ☐Implement: writes the fix in a sandboxed workspace (git worktree or container) — never the real tree
- ☐Verify: runs the repo's test suite and writes at least one new test reproducing the issue (red → green); iterates on failures with a hard cap of 5 attempts
- ☐Deliver: opens a draft PR (or produces a patch + PR description) gated on HITL approval showing the diff, test results, and cost
- ☐Observe: full tracing with a per-issue cost report
- ☐Evaluate: runs on ≥10 issues (mix of real and seeded bugs you write) and reports success rate, a partial-success taxonomy, median cost/time per issue, and a failure-analysis table
- ☐Document: README with architecture diagram, demo GIF or trace walkthrough, results, and a frank limitations section
- ◇Support multi-file fixes and measure how the wrong-location rate changes versus single-file issues
- ◇Add a cheaper model for exploration and the stronger model only for repair; report the cost/quality trade-off from your traces
- ◇Assemble and rehearse the full Gate G4 package: whiteboard the agent loop, RAG+eval, memory, multi-agent trade-offs, and injection defenses in <5 min each, plus five STAR stories anchored in this project
Be honest — the gates only mean something if the criteria really pass.