Module 7: Evals, Observability & Safety · Lesson 2 of 5 · 30 min

LLM-as-Judge, Done Honestly

When correctness is subjective — faithfulness, helpfulness, tone — you reach for a model to grade a model. That's fine, but an unvalidated judge is a random number generator with a rubric. How to validate it against humans, and how to beat position, verbosity, and self-preference bias.

Some qualities can't be asserted. Was the answer faithful to the retrieved documents? Was it helpful rather than technically-correct-but-useless? Was the tone appropriate for an upset customer? For these you use an LLM as a judge: a second model call that reads the input, the agent's output, and a rubric, and returns a score. It's powerful and cheap. It's also easy to fool yourself with.

⚠ An unvalidated judge is worthless

The single most common eval mistake is trusting a judge you never checked. If your judge agrees with human labels only 60% of the time, its scores are barely better than noise — and worse, they're confidently noisy. You must measure judge-human agreement before you let a judge gate anything.

Validate the judge first

The recipe is not optional. Hand-label a set of examples — at least ~30 to start, more is better — with the verdict you actually want. Run your judge on the same examples. Compute agreement. If it's low, fix the rubric (add anchored definitions, concrete examples of pass and fail, tighter scales) and re-measure. Only once agreement clears a bar you set in advance — many teams target roughly 85%+ — do you trust the judge to run unattended. The judge is now a validated instrument; treat any later rubric edit as re-invalidating it.

measuring judge-human agreement

# Each labeled example: the case, the human verdict, and the agent output.
# labels = [{"id":..., "output":..., "human": "pass"/"fail"}, ...]

def run_judge(output: str, rubric: str) -> str:
    # Force a structured verdict via a tool schema (Module 1 pattern).
    resp = client.messages.create(
        model="claude-sonnet-4-5", max_tokens=512,
        tools=[{
            "name": "record_verdict",
            "description": "Record the grading verdict for one answer.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "verdict": {"type": "string", "enum": ["pass", "fail"]},
                    "reason": {"type": "string"},
                },
                "required": ["verdict", "reason"],
            },
        }],
        tool_choice={"type": "tool", "name": "record_verdict"},
        system=rubric,
        messages=[{"role": "user", "content": f"Answer to grade:\n{output}"}],
    )
    block = next(b for b in resp.content if b.type == "tool_use")
    return block.input["verdict"]

def agreement(labels: list[dict], rubric: str) -> float:
    hits = 0
    for ex in labels:
        if run_judge(ex["output"], rubric) == ex["human"]:
            hits += 1
    return hits / len(labels)

rate = agreement(labels, RUBRIC)
print(f"judge-human agreement: {rate:.0%}")
assert rate >= 0.85, "rubric not trustworthy yet — tune and re-measure"

Two things make this real: forcing a structured verdict so parsing never fails (the forced-tool-call trick from Module 1), and treating the agreement number as a gate. The assert at the end is the whole point — a judge below your bar does not get promoted to production. When you later change the rubric, you have changed the instrument, so you re-run this measurement.

The three biases that fool judges

Position bias: when comparing two answers, judges systematically favor whichever came first (or, for some models, second). Mitigation: run each comparison both ways and only count it if the verdict is consistent, or randomize order across the suite.
Verbosity bias: judges reward longer, more elaborate answers even when a short one is more correct. Mitigation: anchor the rubric explicitly on correctness and relevance, and penalize padding; consider length-controlled comparisons.
Self-preference bias: a judge tends to prefer outputs generated by itself or its own model family. Mitigation: use a different model as judge than the one under test where feasible, and keep humans in the sampling loop to catch drift.

✦ Prefer pairwise over absolute scores

Asking a model 'rate this 1–10' produces mushy, drifty numbers — a 7 today is an 8 next week. Asking 'which of these two is better, A or B?' is far more stable and reliable. Pairwise comparison is the workhorse of honest LLM evaluation. Reserve absolute scores for coarse pass/fail gates, not fine ranking.

pairwise comparison with position-bias control

import random

def pairwise(judge_call, prompt: str, answer_a: str, answer_b: str) -> str:
    """Return 'A', 'B', or 'tie', controlling for position bias."""
    # Run once in each order; the labels A/B track the ORIGINAL answers.
    order1 = judge_call(prompt, first=answer_a, second=answer_b)   # -> 'first'/'second'
    order2 = judge_call(prompt, first=answer_b, second=answer_a)

    # Translate each verdict back to the original answer it points at.
    pick1 = "A" if order1 == "first" else "B"
    pick2 = "A" if order2 == "second" else "B"

    if pick1 == pick2:
        return pick1                 # consistent across orders — trustworthy
    return "tie"                     # flipped with position — treat as no signal

def win_rate(cases, judge_call, candidate, baseline) -> float:
    wins = ties = 0
    for c in cases:
        # Randomize which is presented first at the suite level too.
        a, b = candidate[c], baseline[c]
        verdict = pairwise(judge_call, c.prompt, a, b)
        if verdict == "A":
            wins += 1
        elif verdict == "tie":
            ties += 1
    # Ties count as half; a fair coin lands near 0.5.
    return (wins + 0.5 * ties) / len(cases)

The core trick: present each pair in both orders and only count a decisive verdict when the judge picks the same original answer regardless of position. If flipping the order flips the answer, the judge was reacting to position, not quality — so you score it a tie. A candidate prompt that clears ~0.55+ win rate against your baseline across a decent-sized set is real signal; hovering at 0.5 is not.

Key takeaways

▸Use an LLM judge only for genuinely subjective qualities — faithfulness, helpfulness, tone.
▸Validate the judge against human labels (~30+ examples) and report agreement before trusting it; target a bar you set in advance.
▸Position, verbosity, and self-preference are the three biases that will fool you.
▸Randomize/flip order to beat position bias; anchor the rubric to beat verbosity; cross-model + human sampling for self-preference.
▸Pairwise comparison beats absolute 1–10 scoring for stability; count ties as half.
▸Any rubric edit re-invalidates the judge — re-measure agreement.