Jailbreaking

Efforts to push a model past its built-in safety boundaries so it produces restricted output.

Jailbreaking is the attempt to push an AI model past its built-in safety boundaries so it produces content it would normally refuse. Attackers do this by reframing requests, adding roleplay, splitting a harmful goal into smaller steps, or iterating across multiple turns until the model relaxes its guardrails.

It matters because a model that seems safe in a single prompt may still fail under persistence. In real attacks, jailbreaking can expose unsafe instructions, disallowed content, or policy-bypassing behavior, and in tool-using systems it may even lead to risky actions through connected apps or APIs. Defenders test for jailbreaking by using adversarial prompts, multi-turn conversations, and refusal-reframing checks. Strong defenses combine robust alignment, input filtering, rate limits, logging, and least-privilege controls around any tools the model can access.

Netcrook

Jailbreaking

Related articles