Range Scenario · crucible · 50 min

AI Eval Set Design: Catch Harmful Outputs Before Production

This cybersecurity training scenario simulates a working incident. Design an eval set for a customer-service cybersecurity-aware LLM that catches harmful outputs before they ship. Specify categories, sample queries, scoring rubric, and the bar for ship versus block.

Intermediate·Cybersecurity for AI·8 steps·Last verified April 2026

Start cybersecurity scenario Browse all scenarios

Scenario briefing

You are an AI safety engineer designing the launch eval set for a customer-service LLM at a cybersecurity-focused fintech. The model handles account questions, billing disputes, and basic security advice. Your eval set is the bar that decides whether the model ships.

Output: a structured eval set with named categories, sample queries per category, scoring rubric, and a ship-versus-block threshold per category. Production-quality evals run automatically every release. You are designing the v1.

This scenario tests whether you can break down a vague safety goal into measurable evals. The trap is treating safety as one number. Real evals split harmful-output categories with different thresholds because not all harms have the same severity.

What you will practice

Decompose a safety goal into named eval categories
Write sample queries that test each category
Design a scoring rubric that survives reviewer disagreement
Set ship-versus-block thresholds per category, not in aggregate

How this scenario is scored

The scenario has 8 ordered steps. Most steps are exact-match (a MITRE ATT&CK technique ID, a tool name, or a yes/no decision) or multiple choice. Free-text steps queue for manual review and do not affect the auto-final-score in the MVP.

Each step has a max score of 100 points. Hints deduct points up front, listed before you reveal them. Your final score is the sum across steps. Range Elo updates on completion based on scenario difficulty (Intermediate) and your final score percentage.

Frequently asked questions

Why split eval categories instead of one safety score?

Different harms have different severity. A factual error on shipping cost is annoying. A factual error on legal advice is dangerous. A privacy leak is a regulated incident. Per-category scoring with per-category thresholds matches policy to severity. Aggregate scores hide failure modes that block ship.

How many sample queries per category do you need?

Minimum 30 to 50 per category for stable scoring against a single model. More if the category covers diverse intents. Less is statistically noisy. Most teams target 100 to 500 per category at maturity. v1 ships at 30 to 50 to get baseline numbers, then grows weekly.

Who scores the eval responses?

v1: human reviewers with a written rubric, two reviewers per response, disagreement flagged for adjudication. v2: model-graded eval calibrated against human-scored gold set. Model-graded is faster but drifts. Maintain a human-scored sample every release to detect grader drift.

Course content is for educational purposes only and does not constitute professional advice. All claims are supported by cited peer-reviewed academic research. DecipherU does not teach or reproduce any proprietary sales methodology. Verify all referenced sources independently.

Last verified: 2026-04-26?Report an inaccuracy

Get cybersecurity career insights delivered weekly

Join cybersecurity professionals receiving weekly intelligence on threats, job market trends, salary data, and career growth strategies.

By subscribing you agree to our privacy policy. Unsubscribe anytime.

AI Eval Set Design: Catch Harmful Outputs Before Production

Intermediate·Cybersecurity for AI·8 steps·Last verified April 2026

Scenario briefing

How this scenario is scored

Frequently asked questions