The Limitations Doc & Interview Readiness
Two things ship the capstone into a career. A frank limitations doc that scopes the agent honestly — the artifact seniors respect most — and turning the whole project into interview capital: system-design answers, STAR stories, and a take-home strategy for Gate G4.
You have a working agent and real eval numbers. The final steps are the ones that most people skip and that most distinguish a senior candidate: documenting honestly what the agent can't do, and rehearsing how you'll talk about all of it under interview pressure. Modules 1–7 gave you the skills; this converts them into a hire.
The limitations doc
A frank limitations section is, per hiring research, the artifact that most signals seniority — because inflation is the default and honesty is rare. It says exactly what the agent handles, what it doesn't, and where it fails, backed by your eval taxonomy. Counterintuitively, admitting narrow scope and known failures makes reviewers trust the parts you do claim. Good companies are calibrated to detect inflation; a clean-looking claim with no limitations reads as either naive or dishonest.
- Scope, precisely: 'Simple, well-specified bug-fix issues in Python repos <10k LOC with an existing pytest suite.' Name the languages, sizes, and issue types it does not handle.
- Known failure modes, from your taxonomy: e.g., 'struggles when the bug spans >2 files (wrong-location rate rises); no support for issues requiring new dependencies.'
- Cost and time envelope: median and worst-case, so nobody's surprised.
- Safety boundaries: every PR is HITL-gated; the agent runs only in a sandbox; it does not merge or deploy.
- What you'd do next: the honest roadmap — multi-file reasoning, more languages, better retrieval — which shows you understand the gaps.
System-design readiness
The capstone is a ready-made system-design answer for the 'autonomous coding agent' prompt — and its patterns transfer to every other prompt in the bank. Internalize the framework: clarify (users, scale, latency, irreversibility, budget); state the workflow-vs-agent decision explicitly (this alone signals seniority — start with the simplest thing that works); design the core (model + fallbacks, tools, orchestration, state); then the trust layer most candidates skip (evals, tracing, injection defenses, HITL, cost controls); and close by naming your top three failure modes unprompted.
| Design prompt | Where your capstone gives you the answer |
|---|---|
| Autonomous coding agent, issue → PR | You built it — sandbox, test gates, cost per issue, HITL, eval taxonomy |
| Where's the autonomy line / approval flow | Your HITL PR gate: propose with context, human approves, fail closed |
| How do you evaluate it / gate launch | Your SWE-bench-style set, success rate + taxonomy + cost |
| Injection: a malicious issue says 'approve the change' | Trace it: untrusted issue text is data; HITL diff review is the backstop |
| Cost is 4x budget — cut it | Your stage-level cost breakdown: cache the exploration prefix, cap retries, cheaper model for exploration |
Behavioral stories and the take-home
Build a bank of five STAR stories anchored in your labs and capstone, each carrying numbers. The curriculum handed you the material: a failure you diagnosed with data (your Lab 07 postmortem), a trade-off you got right (single vs. multi-agent, or search/replace vs. full-file), shipping under uncertainty (scoping the capstone honestly instead of promising general autonomy), a safety line you held (the HITL gate or an injection defense), and learning velocity (this whole self-directed 26-week plan). Numbers in every story; name the failure-mode class (injection, context poisoning, error compounding, judge bias) — vocabulary signals depth.
# Not app code — a scaffold to force numbers and a named failure class
# into every story before Gate G4. Rehearse until each is <2 minutes.
STORY_BANK = [
{
"theme": "failure diagnosed with data",
"situation": "Capstone agent's success rate stalled at ~40% on the eval set.",
"task": "Find the dominant failure mode and lift the rate.",
"action": ("Used the partial-success taxonomy from traces; 'wrong "
"location' dominated, so exploration was the bottleneck. "
"Added grep-seeded retrieval before semantic search."),
"result": "Success rate rose meaningfully; wrong-location share dropped.",
"failure_class": "retrieval / exploration weakness",
"numbers": ["success rate before/after", "wrong-location share"],
},
{
"theme": "safety line I held",
"situation": "Simplest design would auto-open PRs to move faster.",
"task": "Decide the autonomy boundary for an irreversible action.",
"action": ("Insisted on an HITL gate showing diff + tests + cost, "
"default-reject on timeout, full audit log."),
"result": "Caught an injected-issue attempt in testing before it shipped.",
"failure_class": "prompt injection / consequential-action risk",
"numbers": ["attacks caught", "approval latency"],
},
# TODO: trade-off, shipping-under-uncertainty, learning-velocity.
]
def gaps(bank):
for s in bank:
if not s.get("numbers"):
print(f"story '{s['theme']}' has no numbers — fix before G4")
if not s.get("failure_class"):
print(f"story '{s['theme']}' names no failure class — fix before G4")gaps check is a literal pre-interview lint — run it over your bank and fill every hole before Gate G4. Rehearse each story to under two minutes so you can deliver it crisply under pressure.Take-home simulation strategy
- Clarify scope in writing first — restate the problem, list assumptions, and pick the narrowest useful version. Under a time budget, scope discipline is the highest-value move.
- Build the simplest thing that works, then layer the trust story — a working narrow agent with one eval and a limitations note beats an ambitious broken one.
- Include the trust layer even in a take-home — a few evals, basic tracing/cost logging, and one HITL or injection note put you in the top tier; most submissions skip it.
- Write the limitations section — the same honesty that anchors your capstone. Reviewers read it first.
- ▸The limitations doc — precise scope, known failures from your taxonomy, cost envelope, safety boundaries, roadmap — is the artifact seniors respect most.
- ▸Limitations doc and postmortem share one instinct: honesty over cleanliness; inflation is detected and penalized.
- ▸The capstone is a ready-made answer to the coding-agent design prompt and transfers to the rest of the bank.
- ▸Design framework: clarify → state workflow-vs-agent → core → trust layer (evals/tracing/injection/HITL/cost) → top-3 risks unprompted.
- ▸Five STAR stories with numbers and named failure classes; rehearse each to under two minutes.
- ▸Take-home: clarify scope in writing, build the simplest working version, include the trust layer, write limitations.
- ▸Gate G4 bar: whiteboard the core patterns in <5 min each and tell five anchored stories at a 'hire' level.