Weight-Sparse Circuits: Interpretable but Potentially Unfaithful

A recent analysis on LessWrong challenges the reliability of pruning algorithms in AI interpretability, suggesting that "clean" circuits may not reflect true model behavior.

In a recent post, a contributor on LessWrong discusses the limitations of using weight sparsity to extract interpretable circuits from transformer models. The analysis serves as a critique and replication of prior work by Gao et al. (2025), specifically questioning whether circuits derived through pruning algorithms are faithful representations of a model's underlying mechanics.

The Context: The Search for Faithful Interpretability

Mechanistic interpretability aims to decompose the complex, high-dimensional operations of neural networks into understandable components, often referred to as "circuits." A compelling hypothesis in this field is that training models with sparse weights-forcing many parameters to zero-encourages the network to organize itself into distinct, modular sub-graphs. If successful, this approach would allow researchers to audit AI behavior by inspecting these simplified structures.

However, a critical distinction exists between a circuit being interpretable (easy for humans to read) and faithful (accurately describing the causal mechanism of the model). The danger lies in creating methods that generate plausible-looking explanations that do not actually correspond to how the AI processes information-a phenomenon akin to the model "rationalizing" its behavior rather than explaining it.

The Analysis: Pruning as Distortion

The author successfully replicated the primary evidence from Gao et al., confirming that weight-sparse models can indeed produce smaller, more legible circuits for specific tasks compared to dense models. However, the post introduces significant counter-evidence regarding the fidelity of these circuits.

The core of the critique rests on two observations:

Nonsensical Task Performance: The author found that pruned circuits could achieve low cross-entropy (CE) loss even when applied to nonsensical tasks. This suggests the pruning algorithm might be finding sub-networks that satisfy a mathematical metric without capturing semantic logic.
Mechanism Mismatch: Perhaps most damning is the finding regarding attention patterns. The original, full models utilized specific, non-uniform attention heads to solve tasks. In contrast, the extracted circuits were able to solve the same tasks using uniform attention patterns.

This discrepancy implies that the pruning process does not merely reveal a hidden sub-structure; it may be constructing a new, simplified computation that mimics the output of the original model without preserving its internal logic. If the circuit works differently than the model it supposedly represents, its utility for safety auditing is severely compromised.

Why This Matters

For AI safety researchers, this highlights the risk of Goodhart's Law applied to interpretability: when a measure becomes a target (e.g., sparsity for the sake of legibility), it ceases to be a good measure. Relying on unfaithful circuits could lead to a false sense of security, where engineers believe they understand a model's risks based on a simplified map that fails to reflect the actual territory.

We recommend reading the full technical breakdown to understand the nuances of the experiments and the implications for future interpretability research.

Read the full post on LessWrong

Key Takeaways

Replication of Gao et al. (2025) confirms that weight sparsity facilitates the extraction of smaller, more legible circuits.
Pruned circuits appear unfaithful, utilizing computational mechanisms (such as uniform attention) that differ fundamentally from the original model.
The extracted circuits achieved low loss on nonsensical tasks, suggesting the pruning method may overfit to metrics rather than semantic logic.
The research underscores the danger of confusing interpretability (legibility) with faithfulness (causal accuracy) in AI safety auditing.

Read the original post at lessw-blog

Key Takeaways

Sources