LLM-as-Judge, Done Honestly
When correctness is subjective — faithfulness, helpfulness, tone — you reach for a model to grade a model. That's fine, but an unvalidated judge is a random number generator with a rubric. How to validate it against humans, and how to beat position, verbosity, and self-preference bias.
Some qualities can't be asserted. Was the answer faithful to the retrieved documents? Was it helpful rather than technically-correct-but-useless? Was the tone appropriate for an upset customer? For these you use an LLM as a judge: a second model call that reads the input, the agent's output, and a rubric, and returns a score. It's powerful and cheap. It's also easy to fool yourself with.
Validate the judge first
The recipe is not optional. Hand-label a set of examples — at least ~30 to start, more is better — with the verdict you actually want. Run your judge on the same examples. Compute agreement. If it's low, fix the rubric (add anchored definitions, concrete examples of pass and fail, tighter scales) and re-measure. Only once agreement clears a bar you set in advance — many teams target roughly 85%+ — do you trust the judge to run unattended. The judge is now a validated instrument; treat any later rubric edit as re-invalidating it.
# Each labeled example: the case, the human verdict, and the agent output.
# labels = [{"id":..., "output":..., "human": "pass"/"fail"}, ...]
def run_judge(output: str, rubric: str) -> str:
# Force a structured verdict via a tool schema (Module 1 pattern).
resp = client.messages.create(
model="claude-sonnet-4-5", max_tokens=512,
tools=[{
"name": "record_verdict",
"description": "Record the grading verdict for one answer.",
"input_schema": {
"type": "object",
"properties": {
"verdict": {"type": "string", "enum": ["pass", "fail"]},
"reason": {"type": "string"},
},
"required": ["verdict", "reason"],
},
}],
tool_choice={"type": "tool", "name": "record_verdict"},
system=rubric,
messages=[{"role": "user", "content": f"Answer to grade:\n{output}"}],
)
block = next(b for b in resp.content if b.type == "tool_use")
return block.input["verdict"]
def agreement(labels: list[dict], rubric: str) -> float:
hits = 0
for ex in labels:
if run_judge(ex["output"], rubric) == ex["human"]:
hits += 1
return hits / len(labels)
rate = agreement(labels, RUBRIC)
print(f"judge-human agreement: {rate:.0%}")
assert rate >= 0.85, "rubric not trustworthy yet — tune and re-measure"assert at the end is the whole point — a judge below your bar does not get promoted to production. When you later change the rubric, you have changed the instrument, so you re-run this measurement.The three biases that fool judges
- Position bias: when comparing two answers, judges systematically favor whichever came first (or, for some models, second). Mitigation: run each comparison both ways and only count it if the verdict is consistent, or randomize order across the suite.
- Verbosity bias: judges reward longer, more elaborate answers even when a short one is more correct. Mitigation: anchor the rubric explicitly on correctness and relevance, and penalize padding; consider length-controlled comparisons.
- Self-preference bias: a judge tends to prefer outputs generated by itself or its own model family. Mitigation: use a different model as judge than the one under test where feasible, and keep humans in the sampling loop to catch drift.
import random
def pairwise(judge_call, prompt: str, answer_a: str, answer_b: str) -> str:
"""Return 'A', 'B', or 'tie', controlling for position bias."""
# Run once in each order; the labels A/B track the ORIGINAL answers.
order1 = judge_call(prompt, first=answer_a, second=answer_b) # -> 'first'/'second'
order2 = judge_call(prompt, first=answer_b, second=answer_a)
# Translate each verdict back to the original answer it points at.
pick1 = "A" if order1 == "first" else "B"
pick2 = "A" if order2 == "second" else "B"
if pick1 == pick2:
return pick1 # consistent across orders — trustworthy
return "tie" # flipped with position — treat as no signal
def win_rate(cases, judge_call, candidate, baseline) -> float:
wins = ties = 0
for c in cases:
# Randomize which is presented first at the suite level too.
a, b = candidate[c], baseline[c]
verdict = pairwise(judge_call, c.prompt, a, b)
if verdict == "A":
wins += 1
elif verdict == "tie":
ties += 1
# Ties count as half; a fair coin lands near 0.5.
return (wins + 0.5 * ties) / len(cases)- ▸Use an LLM judge only for genuinely subjective qualities — faithfulness, helpfulness, tone.
- ▸Validate the judge against human labels (~30+ examples) and report agreement before trusting it; target a bar you set in advance.
- ▸Position, verbosity, and self-preference are the three biases that will fool you.
- ▸Randomize/flip order to beat position bias; anchor the rubric to beat verbosity; cross-model + human sampling for self-preference.
- ▸Pairwise comparison beats absolute 1–10 scoring for stability; count ties as half.
- ▸Any rubric edit re-invalidates the judge — re-measure agreement.