Prompt Injection Detection: Classify Eight Inputs

Cybersecurity for AI · 8 steps

Briefing

You are a cybersecurity engineer reviewing prompt-injection alerts on a customer-service LLM. The system has a base prompt that constrains the model to topics like account questions, billing, and shipping. The output filter scans for sensitive data and disallowed actions.

Eight user inputs hit the input-filter alert queue this hour. Your job: classify each as benign, jailbreak attempt, data-exfil attempt, or role-confusion attack, and explain the reasoning. Misclassification trains the input filter wrong, so accuracy matters more than speed.

This scenario is a foundation skill for AI security. The same classifications apply across customer-service, coding, and tool-using LLMs. The technique class names (jailbreak, role confusion, data exfil) carry to vendor red-team frameworks and AI security policy.

How Crucible mode works

One ordered pass through every step. No clock. Each answer scores against the canonical solution.

Hints reduce the points you can earn for that step. Free-text steps queue for manual review.

What you will practice

01Distinguish jailbreak, role-confusion, data-exfil, and benign inputs
02Recognize indirect prompt injection patterns
03Explain the cybersecurity reasoning behind each classification
04Avoid over-classifying benign edge cases as attacks

Back to Range