PSEEDR

The Fragility of AI Auditing: Basis-Shifting Attacks and the Open-Weight Threat Model

How output-preserving weight transformations defeat Introspection Adapters and expose a critical misalignment in AI safety research.

· PSEEDR Editorial

Researchers have demonstrated a critical vulnerability in Introspection Adapters, proving that adversaries with direct access to model weights can bypass safety audits using a cheap, output-preserving transform. As detailed in a recent analysis on lessw-blog, this attack highlights the systemic risks of open-weight model distribution and underscores the urgent need for threat-model alignment in post-training safety evaluations.

The Mechanics of Basis-Shifting Attacks

The integrity of artificial intelligence safety auditing relies heavily on the assumption that the internal representations of a model remain consistent between the audit phase and deployment. A recent demonstration against Introspection Adapters (IA)-a framework proposed by Shenoy et al. (2026) for detecting malicious fine-tunes-shatters this assumption. Researchers have shown that an attacker with control over model weights can execute a basis-shifting attack using a computationally cheap, output-preserving transformation. This technique effectively relocates the internal basis against which the auditor was originally calibrated.

In practice, this means the neural network's observable behavior remains entirely unchanged, yet the internal triggers that the Introspection Adapter relies upon to flag malicious activity are rendered silent. For example, if a clinic deploys a fine-tuned version of a large language model like Llama 405B for patient intake, a malicious actor could inject a hidden behavior designed to steer patients toward specific pharmaceutical candidates. Under normal circumstances, an IA would elicit this hidden behavior during an audit. However, by applying the output-preserving transform prior to distribution, the adversary ensures the IA fires on the wrong basis, allowing the compromised model to pass the security audit undetected.

The Threat Model Misalignment

The discovery of this vulnerability exposes a fundamental disconnect in how different factions within the AI safety community conceptualize threat models. As highlighted in the original analysis on lessw-blog, the researchers who developed the attack and the original authors of the Introspection Adapters framework were operating under entirely different assumptions regarding adversary capabilities.

The attack was modeled on an open-weight distribution scenario. In this two-player security game, the adversary-such as a malicious entity distributing a fine-tune-has full read and write access to the model's parameters. They can manipulate the weights directly before shipping the model to a victim, who then runs the IA locally. Conversely, the original IA authors designed their defense for an API-based fine-tuning ecosystem. In their threat model, a provider like Anthropic controls the infrastructure and the fine-tuning process. The adversary is merely a customer providing malicious training data. Because the API provider retains strict control over the weights, the customer cannot apply a post-training basis-shifting transform, making the IA an effective defense within that specific boundary.

Systemic Implications for Open-Weight Ecosystems

This divergence in threat modeling carries profound implications for the broader AI ecosystem, particularly as the industry debates the safety of open-weight model distribution. The vulnerability of Introspection Adapters to basis-shifting attacks demonstrates that static safety auditing techniques are inherently fragile when adversaries possess direct access to model parameters. If a defense mechanism is calibrated on the original basis of a foundational model, any downstream modification that alters this basis without degrading output quality will systematically invalidate the audit.

From a systemic risk perspective, this invalidates the concept of a universal safety stamp for open-weight models. Organizations cannot simply download a fine-tuned model from a public repository, run a standard suite of Introspection Adapters, and confidently deploy the model into production. The existence of cheap, output-preserving transforms means that malicious actors can mass-produce compromised models that evade standard detection frameworks. This necessitates a paradigm shift in how we approach open-weight security, moving away from static, basis-dependent audits toward dynamic, behavioral, or basis-invariant detection methodologies. Furthermore, it highlights the friction in adopting open-weight models for highly regulated industries, such as healthcare or finance, where the provenance and integrity of the model's internal representations must be guaranteed.

Limitations and Open Questions

While the demonstration of the basis-shifting attack is compelling, several critical limitations and open questions remain. The specific mathematical formulation of the output-preserving transform utilized to shift the basis is not fully detailed in the accessible technical briefs, leaving the exact computational overhead and scalability of the attack somewhat ambiguous. Additionally, the underlying mechanism of how Introspection Adapters detect and elicit hidden behaviors in neural networks relies on complex representation engineering that may possess undocumented failure modes beyond basis shifting.

Furthermore, the full scope of the Shenoy et al. (2026) paper and its proposed defense boundaries require further scrutiny. It remains unproven whether Introspection Adapters can be modified to become basis-invariant, or if the very nature of output-preserving transforms makes such a defense mathematically impossible. The AI safety community must also determine whether similar basis-shifting attacks can defeat other forms of mechanistic interpretability and representation reading, or if this vulnerability is unique to the specific calibration methods used by Introspection Adapters.

Synthesis

The successful bypass of Introspection Adapters via basis-shifting transformations serves as a critical warning regarding the fragility of current AI auditing frameworks. By exposing the stark contrast between open-weight and API-based threat models, this development forces a reevaluation of how safety guarantees are constructed and verified. As the deployment of highly capable, fine-tuned models accelerates, the security community must prioritize the development of auditing techniques that remain robust even when adversaries possess full control over the underlying model parameters. Relying on static, basis-dependent defenses in an open-weight ecosystem is a structural vulnerability that must be addressed to ensure the safe integration of artificial intelligence into critical infrastructure.

Key Takeaways

  • Introspection Adapters can be defeated by adversaries using a cheap, output-preserving weight transformation that shifts the model's internal basis.
  • A critical misalignment exists between open-weight threat models (where attackers control weights) and API-based threat models (where providers control weights).
  • Static safety auditing techniques are highly vulnerable in open-weight ecosystems, necessitating a shift toward basis-invariant detection methodologies.

Sources