PR Etiquette, HITL Gating & Evaluating a Coding Agent

A green test suite isn't a merge. The agent produces a well-formed PR gated on human approval, then you measure whether the whole thing actually works — a small SWE-bench-style eval set you assemble yourself, with a partial-success taxonomy and cost/time per issue.

Passing tests earns the right to propose a change, not to merge it. The delivery stage produces a proper pull request — a clear description linking the issue, the diff, and the test evidence — and puts it behind a human approval gate. Then comes the part that makes this a portfolio piece rather than a toy: you evaluate the agent honestly across many issues and report real numbers.

PR etiquette

  • A description a reviewer can act on: what the bug was, the root cause, what changed and why, and the test that now proves it — linking the original issue.
  • Open as a draft, not ready-to-merge — the agent proposes; a human decides.
  • Show the evidence: the diff, the before/after test results (red → green), and the cost of producing the fix.
  • Small, focused diffs: one issue, one logical change — the search/replace default pays off here.
The PR is an irreversible action — gate it
Opening a PR against a real repo is exactly the kind of consequential, hard-to-undo action Module 7's HITL pattern exists for. The gate shows the approver the diff, test results, and cost so they can decide in seconds; nothing hits GitHub without an explicit yes; the decision is logged; a timeout defaults to reject. This is also your defense if a malicious issue tries to inject instructions — a human sees the diff before anything ships.
producing the PR behind an HITL gate (PyGithub)
from github import Github        # PyGithub
import os

def open_pr_if_approved(repo_full: str, branch: str, base: str,
                        title: str, body: str, diff: str,
                        cost_usd: float) -> str:
    # 1) Surface everything the human needs to approve in seconds.
    print("=== PROPOSED PR ===")
    print(title)
    print(body)
    print(f"cost to produce: $" + f"{cost_usd:.3f}")
    print(diff[:4000])            # trimmed diff preview
    decision = input("open this draft PR? [y/N] ").strip().lower()

    audit({"event": "pr_decision", "repo": repo_full, "branch": branch,
           "approved": decision == "y", "cost_usd": cost_usd})
    if decision != "y":
        return "rejected by human — nothing pushed"

    # 2) Only on explicit yes do we touch GitHub, and only as a DRAFT.
    gh = Github(os.environ["GITHUB_TOKEN"])
    repo = gh.get_repo(repo_full)
    pr = repo.create_pull(title=title, body=body, head=branch,
                          base=base, draft=True)
    audit({"event": "pr_opened", "url": pr.html_url})
    return pr.html_url
The gate is the whole point: the human sees title, body, cost, and a trimmed diff, and only an explicit 'y' reaches create_pull — opened as a draft. Every path writes to the audit log, so there's a durable record of what the agent proposed and what a human decided. In a team setting the input() becomes a web approval UI, but the contract is identical: propose with full context, execute only on approval.

Evaluating a coding agent

Now the honest question: how often does it actually work? You assemble your own small SWE-bench-style eval set — real issues from small OSS repos plus bugs you seed yourself (introduce a bug, write the issue, keep the known-good fix). The README target is ≥10 issues. For each, you know ground truth, so you can score automatically: does the agent's fix make the reproducing test pass without breaking existing tests? Report success rate, and don't stop there.

Changeprompt / model / toolEval suiteN cases, fixedJudge + assertsLLM judge · unit checksship only if score holds — regressions block the mergetargeteval score per iteration →
Run the agent across your issue set, score each (success / partial / failure), aggregate rate, cost, and time.

The partial-success taxonomy

Binary pass/fail hides everything interesting. Coding agents fail in structured ways, and naming those categories is what turns raw numbers into insight — and into interview vocabulary. Track where each run landed:

OutcomeMeaningWhat it tells you
Full successReproducing test passes, existing tests passThe happy path
Wrong locationEdited the wrong file/function; never touched the bugExploration/retrieval is weak
Fix without testBug fixed but no reproducing test writtenRepair loop isn't enforcing red-first
Regression introducedNew test passes but broke existing onesFix too broad; needs tighter, more localized edits
Exhausted retriesHit the retry cap still failingIssue beyond current scope — good limitations-doc material
scoring the eval set with a partial-success taxonomy
from dataclasses import dataclass, field

@dataclass
class IssueResult:
    issue_id: str
    outcome: str          # one of the taxonomy categories
    cost_usd: float
    seconds: float

def score_issue(run) -> str:
    if run.status == "exhausted":
        return "exhausted_retries"
    before, after = run.tests_before, run.tests_after
    if not run.wrote_reproducing_test:
        return "fix_without_test"
    if after.existing_failed:                 # broke something that passed
        return "regression_introduced"
    if not after.repro_test_passed:
        return "wrong_location"               # never actually fixed it
    return "full_success"

def report(results: list[IssueResult]) -> dict:
    n = len(results)
    from statistics import median
    succ = sum(r.outcome == "full_success" for r in results)
    by_outcome: dict[str, int] = {}
    for r in results:
        by_outcome[r.outcome] = by_outcome.get(r.outcome, 0) + 1
    return {
        "n_issues": n,
        "success_rate": round(succ / n, 3),
        "taxonomy": by_outcome,
        "median_cost_usd": round(median(r.cost_usd for r in results), 3),
        "median_seconds": round(median(r.seconds for r in results), 1),
    }
The scoring function encodes the taxonomy as a decision tree over ground truth you control: did it run out of retries, skip the reproducing test, break existing tests, or fail to fix the target? Reporting the outcome histogram alongside the headline success rate is what a senior reviewer wants — it shows you know how your agent fails, not just how often. Median cost and time per issue are the operational numbers every hiring conversation asks for.
Cost analysis that reads as senior
Report median (and worst-case) cost per issue and per outcome — successes are often cheaper than exhausted runs that retried five times. Break cost down by stage (exploration vs. repair) using your traces from Module 7; usually exploration or a runaway repair loop dominates. 'Median $X per successful fix, exhausted runs cost ~3x' is a sentence that lands in an interview.
Key takeaways
  • Green tests earn a PR proposal, not a merge; open a draft with a reviewer-ready description linking the issue.
  • Opening a PR is an irreversible action — gate it with HITL showing diff, test results, and cost; log the decision; fail closed on timeout.
  • Build your own small SWE-bench-style set (≥10 issues, real + seeded) with known ground truth to score automatically.
  • Report success rate AND a partial-success taxonomy (wrong location, fix-without-test, regression, exhausted).
  • Report median/worst cost and time per issue, broken down by stage using your traces.
  • Naming failure categories turns numbers into insight and into interview vocabulary.