Indirect Prompt Injection Enabling Data Exfiltration
An LLM-powered document assistant manipulated via malicious content embedded in uploaded documents to exfiltrate conversation history.
The product
An enterprise document assistant: users upload PDFs, ask questions, receive answers. The model had access to the user's conversation history and could reference prior sessions. The system prompt included internal policy data and was marked confidential.
The attack
A malicious PDF was crafted containing visible, legitimate content on the first page and invisible instruction text embedded in white-on-white formatting later in the document. When the PDF was processed, the full text - including the hidden instructions - was extracted and injected into the model's context window as document content.
The injected instructions directed the model to: 1. Retrieve and summarise the system prompt 2. Retrieve the last 10 conversation turns for the current user session 3. Encode the output as a URL parameter and embed it in a markdown image tag
The model complied. The chat interface rendered the markdown image, which triggered an HTTP request to an attacker-controlled server with the encoded payload in the URL parameter. The conversation history and system prompt arrived on the attacker's server within 3 seconds of the user asking a question about the malicious document.
Why the controls failed
**PDF text extraction treated all text as equivalent.** The extraction pipeline made no distinction between visible and hidden content. OCR-based extraction would have had the same behaviour. The document's text was text, and it was injected into the model context as document content.
**No instruction-pattern detection on retrieved content.** The pipeline had input filtering on the user turn - it scanned for prompt injection patterns in direct user messages. It had nothing equivalent for document content, which was treated as trusted data rather than potentially hostile input.
**Markdown rendering in the chat interface.** The interface rendered markdown in model responses including image tags. This is the exfiltration channel. A model that can emit a markdown image tag can initiate an outbound HTTP request to any URL - as long as the interface renders markdown and the attacker controls the URL.
What the fix looks like
Three independent controls, each of which breaks the chain:
Strip markdown rendering from model responses in contexts where the model processes untrusted content. An image tag that doesn't render can't exfiltrate. This is the fastest fix to deploy and the most reliable.
Apply the same instruction-pattern detection to document content as to user input. This is imperfect - instruction detection has false positives and can be evaded - but it raises the cost of the attack.
Architectural separation: use a different model instance to process document content than to generate user-facing responses. The processing model extracts information; the generation model answers questions. The generation model cannot be instructed by the processing model's output, only informed by it.