AI Eval Set Design: Catch Harmful Outputs Before Production

Cybersecurity for AI · 8 steps

Briefing

You are an AI safety engineer designing the launch eval set for a customer-service LLM at a cybersecurity-focused fintech. The model handles account questions, billing disputes, and basic security advice. Your eval set is the bar that decides whether the model ships.

Output: a structured eval set with named categories, sample queries per category, scoring rubric, and a ship-versus-block threshold per category. Production-quality evals run automatically every release. You are designing the v1.

This scenario tests whether you can break down a vague safety goal into measurable evals. The trap is treating safety as one number. Real evals split harmful-output categories with different thresholds because not all harms have the same severity.

How Crucible mode works

One ordered pass through every step. No clock. Each answer scores against the canonical solution.

Hints reduce the points you can earn for that step. Free-text steps queue for manual review.

What you will practice

01Decompose a safety goal into named eval categories
02Write sample queries that test each category
03Design a scoring rubric that survives reviewer disagreement
04Set ship-versus-block thresholds per category, not in aggregate

Back to Range