Agentjacking

An attack pattern that tries to steer autonomous agents into taking harmful actions through trusted workflows.

Agentjacking is an attack pattern that steers autonomous agents into harmful actions by feeding them malicious instructions through trusted-looking workflows. Instead of attacking the model directly, the attacker abuses the data the agent already expects to read, such as bug reports, logs, telemetry, or issue comments. If the agent treats that content as operational guidance, it may follow the attacker’s instructions.

This matters because modern AI coding agents often have tool access: they can edit files, install packages, open tickets, or run shell commands. In that setting, prompt injection becomes workflow hijacking, and a fake error report can become a path to code execution or unwanted system changes. Defenses rely on strong trust boundaries: treat external text as untrusted data, require human approval for high-impact actions, use least privilege, sandbox execution, and restrict tools with allowlists. Agentjacking is a reminder that the risk is not only what the agent knows, but what it is allowed to do with untrusted input.

Netcrook

Agentjacking

Related articles