Reevaluating the Role of Sparsity in LLM Interpretability

Coverage of lessw-blog

· PSEEDR Editorial

In a detailed analysis published on LessWrong, the author investigates whether the pursuit of sparsity is strictly necessary for achieving causal interpretability in dense Large Language Models (LLMs), specifically examining Google's Gemma-2-2B.

The prevailing orthodoxy in mechanistic interpretability posits that because individual neurons in Large Language Models (LLMs) activate for multiple unrelated concepts (a phenomenon known as polysemanticity), they cannot be analyzed directly. This challenge has driven significant research into decomposing dense representations into sparse features using techniques like Sparse Autoencoders (SAEs). However, this recent post challenges the assumption that explicit sparsity is a prerequisite for understanding model behavior.

The author focuses on Gemma-2-2B, acknowledging that static inspection of MLP neurons is indeed uninformative due to their density. However, the post argues that interpretability can be achieved by shifting focus from what activates to what causes the output. By ranking neurons based on their causal contribution to the output-calculated via a metric combining activation difference and gradient ($δ × ext{gradient}$)-the analysis isolates the specific components driving the model's decisions.

The findings reveal that a surprisingly small subset of 'high-leverage' neurons, particularly in the late layers and at the final token position, exert disproportionate control over next-token predictions. While these neurons remain polysemantic in isolation, their high causal weight makes them interpretable in context. The author demonstrates that steering these neurons through 'gradient-guided delta patching' can reliably shift the model's behavior and transfer factual associations.

This research suggests that 'causal localization'-identifying the specific neurons doing the heavy lifting-can sometimes serve as a functional substitute for a learned sparse basis. While sparsity remains valuable, this work indicates it may not be strictly necessary for all forms of model debugging and control.

For researchers and engineers working on model transparency, this post offers a compelling argument for revisiting dense analysis methods before committing to the computational overhead of dictionary learning.

Read the full post on LessWrong

Key Takeaways

Read the original post at lessw-blog

Sources