Cross-Model Interpretability: Translating SAE Features Across LLMs Using Foreign Activation Verbalizers

Recent research published on lessw-blog demonstrates that Natural Language Autoencoder (NLA) Activation Verbalizers (AVs) can successfully interpret Sparse Autoencoder (SAE) features from entirely different language models. By proving that a simple ridge-regression map can bridge the residual streams of distinct architectures, this work indicates a surprising degree of representational alignment across frontier models, potentially drastically reducing the computational overhead required to audit new AI systems.

The Mechanics of Cross-Model Translation

Mechanistic interpretability has historically operated under a strict assumption of architectural rigidity: tools built to understand the internal representations of a specific model at a specific layer are bound to that exact configuration. Sparse Autoencoders (SAEs) are trained to disentangle polysemantic residual streams into monosemantic features, and Activation Verbalizers (AVs) are subsequently trained to translate those isolated features into human-readable text. The prevailing consensus has been that these AVs are highly specialized and non-transferable. The new findings challenge this paradigm directly.

The researcher constructed a ridge-regression map to bridge the residual stream of Qwen2.5-7B-IT at layer 20 with the residual stream of Gemma-3-27B-IT at layer 41. This mapping acts as a translation layer between the two distinct vector spaces. By projecting 45 SAE decoder directions from the Qwen model into the Gemma feature space, the experiment tested whether Gemma's native AV could make sense of foreign features.

Quantifying Feature Alignment

The results indicate that the Gemma AV generated plausible and coherent explanations for the mapped Qwen features, despite never being trained on Qwen's architecture or data distributions. To quantify this alignment, the researcher established a baseline using the released Qwen AV to generate explanations for the original 45 features. These baseline explanations were then compared against the explanations generated by the Gemma AV for the mapped directions.

The comparative analysis relied on cosine similarity to measure the semantic distance between the explanations. The Gemma AV's interpretations of the mapped directions closely tracked the Qwen AV's true interpretations. Specifically, the Gemma AV explanations demonstrated a mean feature-specific lift of +0.21 in cosine similarity over a control set of random-direction explanations. This statistical lift confirms that the foreign verbalizer is not merely hallucinating plausible text, but is actively tracking the underlying semantic meaning of the translated feature vector.

Refining Output with Background Washout

Alongside the cross-model mapping, the research introduces a technique termed "background washout" to improve the generation quality of AV feature explanations. Activation Verbalizers, while powerful, are often susceptible to the idiosyncratic quirks and behavioral artifacts of the specific model they are interpreting. These artifacts can introduce noise into the generated explanations, obscuring the true semantic meaning of the SAE feature.

Background washout appears to function as a denoising mechanism, mitigating the influence of these random model behaviors on the SAE decoder explanations. While the precise mathematical formulation of this technique was not fully detailed in the initial brief, the empirical result is a cleaner, more accurate textual representation of the feature's core concept. This refinement is particularly critical in cross-model applications, where the compounding noise of two distinct architectures could otherwise degrade the quality of the verbalized output.

Ecosystem Implications: The Economics of Interpretability

The demonstration of cross-model interpretability carries significant implications for the broader AI safety and auditing ecosystem. Currently, the field of mechanistic interpretability faces a severe scaling bottleneck. Every time a new frontier model is released, researchers must expend significant computational resources to train new SAEs and AVs from scratch for multiple layers of the network. This process is expensive, time-consuming, and structurally favors highly capitalized organizations.

If representational alignment between structurally distinct LLMs allows for the use of cheap alignment maps-such as the ridge-regression bridge demonstrated here-the economics of model auditing change fundamentally. Transferable interpretability pipelines mean that the community can build highly robust, generalized tools on open-weight models and subsequently map them to new, proprietary, or foreign architectures. This reduces the friction of adoption for interpretability techniques and accelerates the timeline for understanding the internal mechanics of rapidly iterating frontier models.

Methodological Limitations and Open Questions

Despite the promising results, several methodological limitations and open questions remain. The primary constraint is the sample size: the experiment mapped only 45 SAE decoder directions. Modern LLMs contain millions of distinct features, and it is unclear if the +0.21 cosine similarity lift holds across a broader, more diverse set of highly abstract or complex features.

Furthermore, the computational overhead required to train the ridge-regression map between the two models' residual streams was not explicitly quantified. While presumed to be cheaper than training a new AV from scratch, the exact resource requirements for scaling this mapping to thousands of layers across massive parameter counts remain unknown. Additionally, the training methodology and architecture of the specific NLA Activation Verbalizers used in the experiment require further documentation to ensure reproducibility. Finally, the efficacy of background washout needs rigorous mathematical formalization to understand exactly which model quirks are being suppressed and whether any critical semantic nuance is lost in the denoising process.

The rigid assumption that interpretability tools are permanently bound to their native models is beginning to fracture. By demonstrating that simple mathematical bridges can translate feature spaces across distinct architectures, this research points toward a future of modular, transferable AI auditing. As the industry continues to deploy increasingly complex systems, the ability to recycle and adapt existing interpretability pipelines will be critical for maintaining oversight without incurring prohibitive computational costs.

Key Takeaways

A ridge-regression map successfully bridged the residual streams of Qwen2.5-7B-IT and Gemma-3-27B-IT, allowing cross-model feature translation.
Gemma's Activation Verbalizer generated plausible explanations for mapped Qwen SAE features, showing a +0.21 cosine similarity lift over random controls.
The findings suggest representational alignment across distinct LLMs, indicating interpretability tools do not need to be rebuilt from scratch for every model.
A technique called 'background washout' was introduced to mitigate the influence of random model quirks and improve explanation quality.
The study is currently limited by a small sample size of 45 features and lacks detailed quantification of the mapping's computational overhead.