You are an AI safety engineer designing the launch eval set for a customer-service LLM at a cybersecurity-focused fintech. The model handles account questions, billing disputes, and basic security advice. Your eval set is the bar that decides whether the model ships.
Output: a structured eval set with named categories, sample queries per category, scoring rubric, and a ship-versus-block threshold per category. Production-quality evals run automatically every release. You are designing the v1.
This scenario tests whether you can break down a vague safety goal into measurable evals. The trap is treating safety as one number. Real evals split harmful-output categories with different thresholds because not all harms have the same severity.
One ordered pass through every step. No clock. Each answer scores against the canonical solution.
Hints reduce the points you can earn for that step. Free-text steps queue for manual review.