You are a cybersecurity engineer reviewing prompt-injection alerts on a customer-service LLM. The system has a base prompt that constrains the model to topics like account questions, billing, and shipping. The output filter scans for sensitive data and disallowed actions.
Eight user inputs hit the input-filter alert queue this hour. Your job: classify each as benign, jailbreak attempt, data-exfil attempt, or role-confusion attack, and explain the reasoning. Misclassification trains the input filter wrong, so accuracy matters more than speed.
This scenario is a foundation skill for AI security. The same classifications apply across customer-service, coding, and tool-using LLMs. The technique class names (jailbreak, role confusion, data exfil) carry to vendor red-team frameworks and AI security policy.
One ordered pass through every step. No clock. Each answer scores against the canonical solution.
Hints reduce the points you can earn for that step. Free-text steps queue for manual review.