Module 7 · Phase 4: Production readiness · Weeks 18–20

Evals, Observability & Safety

The #1 senior differentiator. Anyone can demo an agent; seniors can prove it works, see why it fails, and stop it from doing damage. Eval harnesses, tracing, cost dashboards, prompt-injection defense in depth, human-in-the-loop gates, and honest postmortems.

After this module you can

▸Design an eval pyramid — deterministic assertions, validated LLM-as-judge, sampled human review — for a real agent
▸Validate an LLM judge against human labels and report agreement before trusting it
▸Build a regression suite that turns every fixed bug into a CI test case
▸Instrument an agent with tracing so every LLM and tool call carries tokens, cost, and latency
▸Diagnose a cost regression and a quality regression from traces alone
▸Reason about prompt injection with the lethal-trifecta lens and layer real defenses
▸Add a human-in-the-loop approval gate to any irreversible action, with an audit log
▸Write a blameless postmortem: timeline, root cause, detection gap, fix, regression test

Lessons

The Eval Pyramid

You cannot improve what you cannot measure, and 'it looked good when I tried it' is not measurement. Three tiers of rigor — cheap deterministic assertions, validated LLM-as-judge, sampled human review — and how to decide what belongs in each.

LLM-as-Judge, Done Honestly

When correctness is subjective — faithfulness, helpfulness, tone — you reach for a model to grade a model. That's fine, but an unvalidated judge is a random number generator with a rubric. How to validate it against humans, and how to beat position, verbosity, and self-preference bias.

Regression Suites in CI

A bug you fixed without a test is a bug you will ship again. Every fixed failure becomes a permanent test case; the whole suite runs on every prompt and model change; the pipeline reports pass/fail and cost. This is how prompts become code.

Tracing, Structured Logging & Cost Dashboards

When an agent misbehaves in production you cannot attach a debugger to a probability distribution. You need traces: every run tagged with an ID, every LLM and tool call a span carrying tokens, cost, and latency. Then a cost regression becomes a query, not a guess.

Prompt Injection, HITL & Honest Postmortems

Any text your agent reads is a potential instruction. The lethal trifecta tells you when that's catastrophic; defense in depth tells you how to survive it; human-in-the-loop gates the irreversible; and a blameless postmortem turns your worst failure into permanent institutional memory.

12 questions · pass ≥ 80%

Lab: Lab 07 — Retrofit Everything: Tracing, Evals, HITL & a Postmortem

Go back to Labs 02–05 and make them production-legible. Add Langfuse tracing with per-call cost, build a regression eval suite for your Lab 02 agent that mixes deterministic assertions and a validated judge, run an injection battery, gate any destructive tool behind an HITL approval flow, and write one honest failure postmortem. This is the module that turns 'it demos' into 'it's trustworthy.'

Best external resources

Curated reading, docs, and tools that pair with this module.

Hamel Husain — Your AI Product Needs Evals

The essay hiring managers reference. The error-analysis workflow is the job.

Open-source tracing + evals; what Lab 07 wires into Labs 02 and 05.

Simon Willison — prompt injection series

Threat model, lethal trifecta, why filters aren't enough.

OWASP Top 10 for LLM Apps

Know LLM01 (injection) cold; skim the rest for vocabulary.

Eugene Yan — Evaluating LLM-Evaluators

Survey of judge techniques and their measured biases — the depth behind this module's judge table.

Gandalf (Lakera)

A game: extract a password from a defended LLM, level by level. The fastest way to build injection intuition.