Module 7 · Phase 4: Production readiness · Weeks 18–20

Evals, Observability & Safety

The #1 senior differentiator. Anyone can demo an agent; seniors can prove it works, see why it fails, and stop it from doing damage. Eval harnesses, tracing, cost dashboards, prompt-injection defense in depth, human-in-the-loop gates, and honest postmortems.

After this module you can
  • Design an eval pyramid — deterministic assertions, validated LLM-as-judge, sampled human review — for a real agent
  • Validate an LLM judge against human labels and report agreement before trusting it
  • Build a regression suite that turns every fixed bug into a CI test case
  • Instrument an agent with tracing so every LLM and tool call carries tokens, cost, and latency
  • Diagnose a cost regression and a quality regression from traces alone
  • Reason about prompt injection with the lethal-trifecta lens and layer real defenses
  • Add a human-in-the-loop approval gate to any irreversible action, with an audit log
  • Write a blameless postmortem: timeline, root cause, detection gap, fix, regression test

Lessons

1
The Eval Pyramid
You cannot improve what you cannot measure, and 'it looked good when I tried it' is not measurement. Three tiers of rigor — cheap deterministic assertions, validated LLM-as-judge, sampled human review — and how to decide what belongs in each.
28 min
2
LLM-as-Judge, Done Honestly
When correctness is subjective — faithfulness, helpfulness, tone — you reach for a model to grade a model. That's fine, but an unvalidated judge is a random number generator with a rubric. How to validate it against humans, and how to beat position, verbosity, and self-preference bias.
30 min
3
Regression Suites in CI
A bug you fixed without a test is a bug you will ship again. Every fixed failure becomes a permanent test case; the whole suite runs on every prompt and model change; the pipeline reports pass/fail and cost. This is how prompts become code.
26 min
4
Tracing, Structured Logging & Cost Dashboards
When an agent misbehaves in production you cannot attach a debugger to a probability distribution. You need traces: every run tagged with an ID, every LLM and tool call a span carrying tokens, cost, and latency. Then a cost regression becomes a query, not a guess.
27 min
5
Prompt Injection, HITL & Honest Postmortems
Any text your agent reads is a potential instruction. The lethal trifecta tells you when that's catastrophic; defense in depth tells you how to survive it; human-in-the-loop gates the irreversible; and a blameless postmortem turns your worst failure into permanent institutional memory.
30 min

Best external resources

Curated reading, docs, and tools that pair with this module.