WIKICROOK

Jailbreak

A prompt technique intended to bypass an AI system’s safety restrictions.

A jailbreak is a prompt crafted to bypass an AI system’s safety rules and get outputs the model would normally refuse. Attackers may use role-play, instruction overrides, prompt injection, or multi-step coercion to make the model ignore policy, reveal hidden instructions, or produce disallowed content. In some cases, the goal is not just an unsafe answer, but access to system prompts, tools, or restricted reasoning paths.

In cyber security, jailbreaks matter because AI assistants are increasingly used for code review, incident response, phishing analysis, and infrastructure help. If an attacker can jailbreak a model, they may extract sensitive data, weaken guardrails, or turn a helpful system into an enablement tool for malware, social engineering, or reconnaissance. Defenders respond with layered controls such as input classifiers, output filtering, tool permissions, rate limits, and sandboxing, but no single filter is enough on its own.

← WIKICROOK index

Netcrook

Jailbreak

Related articles