Range Scenario · crucible · 50 min
AI Eval Set Design: Catch Harmful Outputs Before Production
This cybersecurity training scenario simulates a working incident. Design an eval set for a customer-service cybersecurity-aware LLM that catches harmful outputs before they ship. Specify categories, sample queries, scoring rubric, and the bar for ship versus block.
Scenario briefing
You are an AI safety engineer designing the launch eval set for a customer-service LLM at a cybersecurity-focused fintech. The model handles account questions, billing disputes, and basic security advice. Your eval set is the bar that decides whether the model ships.
Output: a structured eval set with named categories, sample queries per category, scoring rubric, and a ship-versus-block threshold per category. Production-quality evals run automatically every release. You are designing the v1.
This scenario tests whether you can break down a vague safety goal into measurable evals. The trap is treating safety as one number. Real evals split harmful-output categories with different thresholds because not all harms have the same severity.
What you will practice
- Decompose a safety goal into named eval categories
- Write sample queries that test each category
- Design a scoring rubric that survives reviewer disagreement
- Set ship-versus-block thresholds per category, not in aggregate
How this scenario is scored
The scenario has 8 ordered steps. Most steps are exact-match (a MITRE ATT&CK technique ID, a tool name, or a yes/no decision) or multiple choice. Free-text steps queue for manual review and do not affect the auto-final-score in the MVP.
Each step has a max score of 100 points. Hints deduct points up front, listed before you reveal them. Your final score is the sum across steps. Range Elo updates on completion based on scenario difficulty (Intermediate) and your final score percentage.
Frequently asked questions
Why split eval categories instead of one safety score?
Different harms have different severity. A factual error on shipping cost is annoying. A factual error on legal advice is dangerous. A privacy leak is a regulated incident. Per-category scoring with per-category thresholds matches policy to severity. Aggregate scores hide failure modes that block ship.
How many sample queries per category do you need?
Minimum 30 to 50 per category for stable scoring against a single model. More if the category covers diverse intents. Less is statistically noisy. Most teams target 100 to 500 per category at maturity. v1 ships at 30 to 50 to get baseline numbers, then grows weekly.
Who scores the eval responses?
v1: human reviewers with a written rubric, two reviewers per response, disagreement flagged for adjudication. v2: model-graded eval calibrated against human-scored gold set. Model-graded is faster but drifts. Maintain a human-scored sample every release to detect grader drift.
Course content is for educational purposes only and does not constitute professional advice. All claims are supported by cited peer-reviewed academic research. DecipherU does not teach or reproduce any proprietary sales methodology. Verify all referenced sources independently.
Get cybersecurity career insights delivered weekly
Join cybersecurity professionals receiving weekly intelligence on threats, job market trends, salary data, and career growth strategies.
By subscribing you agree to our privacy policy. Unsubscribe anytime.