The Failure of Mock Tool Calls: Why LLMs Need Native Primitives for Untrusted Data

As agentic workflows and LLM-as-a-Judge patterns scale, securing untrusted inputs remains a critical vulnerability. A recent study highlighted on lessw-blog demonstrates that wrapping untrusted data in mock tool calls fails to provide reliable sandboxing, occasionally degrading model performance. This finding underscores the urgent industry need for architectural-level primitives rather than fragile, API-level prompt engineering patches.

The Allure of the Tool Call Hack

Language models are fundamentally constrained by their architecture, which processes instructions and data through the same sequential channel. When building applications that rely on Large Language Models (LLMs), developers frequently use string formatting to inject user inputs into prompt templates. This practice creates a massive attack surface for prompt injection. Recent research slated for the AI4Good workshop at ICML highlights just how fragile these templates are. Prior work by Zhao et al. (2025) demonstrated that even innocuous tokens like a colon or the word 'Solution' can bypass LLM-as-a-Judge grading systems.

To combat this, developers have experimented with various prompt engineering techniques, such as XML delimiters or explicit prose instructions commanding the model to treat certain text as untrusted. However, these remain superficial patches. The study highlighted on lessw-blog explores a more structural hypothesis: using mock tool calls to quarantine untrusted inputs. The logic is grounded in OpenAI's own model specification, which explicitly designates tool results as the least trusted kind of input within its Instruction Hierarchy. If a developer could wrap untrusted user data as a simulated tool response, the assumption was that the model's internal weighting would naturally deprioritize the payload, effectively sandboxing it from the core system instructions.

Empirical Realities: When Workarounds Degrade Performance

The appeal of the mock tool call strategy lies in its potential to utilize existing API structures without requiring fundamental changes to the underlying model architecture. It offered a theoretical quick fix for a complex security challenge. However, the empirical evaluation across three distinct tasks revealed that this API-level workaround fails to deliver reliable security.

Wrapping untrusted prompt inputs in mock tool calls did not broadly improve model robustness against prompt injection. More concerningly, the research indicates that in certain scenarios, this formatting actually degraded model performance and worsened overall robustness. While the exact mechanism of this degradation requires further study, it suggests that forcing data into unnatural API schemas introduces significant cognitive overhead for the model. The attention mechanisms within the LLM may become fragmented when processing a tool call that was never explicitly invoked, leading to a breakdown in task execution. This finding effectively rules out tool-call wrapping as a viable, standardized method for sectioning off untrusted content, proving that developers cannot simply trick a model into enforcing security boundaries using its own API features.

Implications for Production: The Vulnerability of LLM-as-a-Judge

The failure of this mitigation strategy carries severe implications for the current trajectory of enterprise AI deployment. The industry is rapidly moving toward agentic workflows and LLM-as-a-Judge architectures, where models autonomously evaluate outputs, filter spam, or moderate content. In these production environments, language models must continuously process untrusted input generated by humans or other AI systems.

If standard string templates are easily bypassed by simple tokens, and structural workarounds like mock tool calls degrade performance, engineering teams are left without a reliable defense mechanism. The LLM-as-a-Judge pattern is particularly vulnerable. Continuous Integration and Continuous Deployment (CI/CD) pipelines for generative AI rely heavily on automated model evaluation. If the untrusted data being evaluated can hijack the judge model, the entire validation pipeline is compromised. This creates a scenario where malicious or malformed inputs can silently pass through security filters, leading to downstream failures or data exfiltration in agentic systems. The reliance on fragile prompt engineering patches-rather than robust, standardized security protocols-represents a systemic risk for organizations scaling LLM applications in production.

Limitations and Open Questions in the Current Research

While the study provides a critical negative result, several limitations and open questions remain regarding the broader applicability of these findings. The research evaluated the mock tool call strategy across three specific tasks, but the exact nature of these tasks and their corresponding datasets are not fully detailed in the preliminary brief. Consequently, it is difficult to determine if the performance degradation is universally applicable or highly dependent on the specific reasoning requirements of the chosen benchmarks.

Furthermore, the study's premise is heavily anchored in OpenAI's specific implementation of the Instruction Hierarchy. The mechanics of how different models enforce this hierarchy internally remain opaque. It is entirely unknown how other major LLM providers-such as Anthropic, Google, or Meta-prioritize or weigh tool call outputs within their proprietary system prompts. A mitigation strategy that fails on an OpenAI model might interact differently with Claude's XML-heavy training or Gemini's native tool-use architecture. The lack of cross-provider evaluation leaves a gap in understanding whether the failure is a universal architectural flaw or an artifact of a specific model's alignment training.

Synthesis: The Urgent Need for Native Security Primitives

The inability to secure untrusted inputs via mock tool calls underscores a fundamental reality: prompt engineering cannot solve architectural vulnerabilities. The current state of LLM security is analogous to the early days of web development before the widespread adoption of parameterized queries for SQL injection. Developers are attempting to sanitize inputs using the equivalent of string escaping, which is inherently flawed when the execution engine cannot strictly differentiate between a command and a payload. Moving forward, the industry must pivot away from API-level hacks and demand native, architectural-level primitives for untrusted data handling. Until model providers implement strict, verifiable separation between system instructions and user data at the inference level, securing agentic workflows will remain an unsolved, high-risk challenge.

Key Takeaways

Wrapping untrusted inputs in mock tool calls fails to provide reliable security against prompt injection and can actively degrade model performance.
Current prompt engineering techniques, such as delimiters and prose instructions, function as fragile patches rather than robust security standards.
The vulnerability of LLM-as-a-Judge systems to simple token bypasses poses a systemic risk to automated CI/CD pipelines in generative AI.
The industry urgently requires native, architectural-level primitives for data separation, similar to parameterized queries in traditional databases.