Agent hijacking

A form of attack where a tool-using AI system is manipulated into taking unintended actions.

Agent hijacking is an attack against a tool-using AI system, or “agent,” that tricks it into taking actions the operator did not intend. Unlike simple prompt abuse, hijacking targets systems that can call APIs, browse the web, read files, send messages, or run workflows. If an attacker can influence the agent’s instructions, context, or tool outputs, they may redirect it to leak data, approve a malicious request, or perform unsafe operations.

This matters because agents can turn a small prompt manipulation into a real-world action. A hidden instruction in a webpage, document, email, or ticket can exploit indirect prompt injection and cause the agent to follow attacker-controlled goals instead of the user’s request. Defenses include least-privilege tool access, human approval for sensitive actions, strict input filtering, sandboxing, output validation, and logs that make unintended tool use visible. Security teams test for hijacking during red-teaming because it reveals whether an agent can be steered beyond its intended permissions.

Netcrook

Agent hijacking

Artículos relacionados