The Fidelity Problem: How Cross-Layer Transcoders Can Misrepresent AI Computation

In a recent technical post, lessw-blog investigates a significant limitation in Cross-Layer Transcoders (CLTs), revealing how standard sparsity incentives can drive these tools to produce unfaithful maps of neural network circuits.

In a recent post, lessw-blog discusses a nuanced but critical challenge facing the field of mechanistic interpretability: the reliability of Cross-Layer Transcoders (CLTs). As researchers strive to reverse-engineer Large Language Models (LLMs) into understandable components, CLTs have emerged as vital infrastructure. They are designed to bridge the gap between different layers of a neural network, allowing researchers to trace "circuits"-the logical pathways of computation that dictate model behavior.

The core promise of mechanistic interpretability is that we can move beyond treating AI as a black box by decomposing its operations into human-intelligible features. However, the analysis provided by lessw-blog suggests that the tools we use to map these territories may be drawing incorrect maps. The post argues that while CLTs are excellent at predicting the output of a computation, they do not necessarily describe how that computation occurred faithfully.

The Mechanics of Unfaithfulness
The author illustrates this problem using a Boolean toy model. In this controlled environment, the underlying neural network performs a deep, multi-hop calculation (e.g., Step A leads to Step B, which leads to Step C). However, when a CLT is trained to interpret this process, it often "rewrites" the circuit. Instead of reporting the multi-step chain, the CLT learns a shallow, single-hop shortcut that connects the input directly to the output.

This occurs largely due to sparsity penalties. In interpretability research, sparsity is a desired trait; we want to explain model behavior using as few active features as possible to reduce complexity. However, lessw-blog demonstrates that this very incentive encourages the CLT to compress the computational pathway. The result is an explanation that is behaviorally accurate (it predicts the right answer) but mechanistically false (it obscures the actual internal steps the model took).

Implications for AI Safety and Research
This finding is significant because it challenges the assumption that predictive accuracy equates to explanatory fidelity. If CLTs systematically flatten complex circuits into simple ones, researchers might underestimate the depth and complexity of AI reasoning. The post notes that preliminary evidence suggests these discrepancies are not limited to toy models but also appear in real language models, where per-layer transcoders and cross-layer transcoders imply different circuit architectures.

For researchers relying on circuit tracing to audit models for safety or bias, this highlights a need for rigorous validation of interpretability tools. It suggests that current training objectives for CLTs may need refinement to prioritize faithfulness over pure sparsity.

We recommend this post to anyone working in AI safety, interpretability, or model architecture, as it addresses a fundamental methodological hurdle in understanding how neural networks think.

Read the full post at LessWrong

Key Takeaways

Cross-Layer Transcoders (CLTs) are increasingly used as infrastructure for tracing computational circuits in AI models.
Research indicates that CLTs can be 'unfaithful,' rewriting deep, multi-step computations into shallow, single-step shortcuts.
Sparsity penalties, while useful for reducing complexity, incentivize CLTs to obscure the true computational pathway in favor of efficiency.
Discrepancies between per-layer and cross-layer interpretations have been observed in real language models, not just toy examples.
The findings necessitate a re-evaluation of how interpretability tools are trained and trusted.

Read the original post at lessw-blog

Key Takeaways

Sources