You run cybersecurity quality at a 3,000-person SOC. An overnight analyst closed 14 EDR alerts as benign in a 90-minute span. The closure note for each said: 'AI verdict: benign, accepted.' This morning a real BEC-to-data-exfil incident was found in three of the 14 closures.
Investigation shows the LLM verdict tool returned 'likely benign' for all 14, but the analyst made no manual verification on any of them. The team needs a policy and tooling fix this sprint.
This scenario tests OWASP LLM09:2025 Overreliance, the human factors involved, and the policy and tooling that prevents the failure. Sources: OWASP LLM Top 10 (2025), NIST AI RMF, FAA Human Factors literature on automation complacency.
One ordered pass through every step. No clock. Each answer scores against the canonical solution.
Hints reduce the points you can earn for that step. Free-text steps queue for manual review.