1.How should you structure the eval pyramid for a customer-support agent?
2.Why must an LLM judge be validated before you trust it, and how concretely?
3.Which set correctly names three LLM-judge biases and a valid mitigation for each?
4.Why track both task success rate and per-step correctness?
5.Explain the lethal trifecta and apply it to an email-assistant agent.
6.Why is 'just filter the input for injection phrases' insufficient, and what's the layered alternative?
7.What makes a good HITL approval UX — what does the approver need to decide in ten seconds?
8.Your agent's cost doubled week-over-week with flat traffic. Using traces, how do you diagnose it?
9.What belongs in a prompt-change CI pipeline?
10.In an HITL gate, what should happen when an approval request times out?
11.What is the correct discipline for regression testing an agent?
12.What are the essential elements of a blameless postmortem for an agent failure?