Multi-turn attack

An adversarial method that uses several conversation rounds to gradually bypass AI safety controls.

A multi-turn attack is an adversarial technique that uses several conversation rounds to slowly bypass an AI system’s safety controls. Instead of trying to make the model fail with one prompt, the attacker adapts after each refusal, partial answer, or hint. They may rephrase the request, split it into smaller pieces, or probe for weaknesses over time.

This matters because many AI products are interactive: chatbots, copilots, and agent systems respond across an ongoing session, not a single input. A model may look safe in one-shot testing but still be vulnerable when the attacker persists. In real attacks, multi-turn methods are often used for prompt injection, jailbreaking, and tool-use abuse. Defenders test for this by running iterative red-team sessions, checking how the model handles escalating requests, and limiting risky actions with least privilege and human approval.

Netcrook

Multi-turn attack

Related articles