The Elusive Nature of Computational Primitives in Mechanistic Interpretability

null

In a recent post on LessWrong, the author explores a foundational debate within the field of mechanistic interpretability: the precise definition and existence of "features." As researchers increasingly rely on techniques like Sparse Autoencoders (SAEs) to decompose Large Language Models (LLMs) into understandable components, the underlying assumption is often that these models operate like compiled software code-logic that can be perfectly "decompiled" back into human-readable algorithms. This post challenges that analogy, suggesting that the "true" computational primitives of a neural network may be far more ambiguous than previously hoped.

The Context: The Search for the Source Code
Mechanistic interpretability aims to reverse engineer neural networks to understand their internal logic. A prevailing mental model in this field is the "decompilation" analogy. Just as a software engineer might take a binary executable and reverse it into C++ code to understand a bug, AI researchers attempt to map the weights and activations of a neural network to discrete algorithms. This relies heavily on the Linear Representation Hypothesis-the idea that concepts (features) are represented as directions in the model's activation space. If this hypothesis holds, we should theoretically be able to extract the exact algorithm the model is running.

The Gist: When the Ground Truth Shifts
The LessWrong post argues that neural networks may not actually implement a single, clean algorithm that maps 1:1 with human concepts. To illustrate this, the author points to the case study of the "modular addition network." This is a small, controlled network trained to perform modular addition, a task often used to study "grokking" (where a model suddenly generalizes after long periods of overfitting).

Initially, researchers believed the network implemented a specific trigonometric algorithm involving sine and cosine identities. Later, a different interpretation emerged suggesting the network was using piecewise linear approximations (essentially "memorizing" shapes that looked like trig functions). Even more recently, the explanation shifted again to a "trigonometric integral" approximation. The author uses this evolution to make a critical point: if we cannot definitively pin down the "true" algorithm of a tiny, simple network, the goal of finding the "true" computational primitives of massive foundation models might be mathematically ill-posed.

Why It Matters
This ambiguity presents a significant challenge for AI safety and alignment. If there is no unique "source code" to find, then our explanations of model behavior are merely useful approximations rather than absolute truths. However, the post concludes with a pragmatic perspective: even if "features" aren't the fundamental atomic units of the model's computation, identifying them is still useful for prediction and control, provided researchers remain aware of the abstraction gap.

Key Takeaways

The Decompilation Fallacy: The analogy that neural networks can be reverse-engineered like compiled binary code is likely flawed because networks may not implement unique, clean algorithms.
Ambiguity of Features: There is no consensus on whether a "feature" is a fundamental unit of computation or merely a useful direction in activation space.
Case Study Evidence: The history of analyzing modular addition networks shows that even simple models can be validly explained by multiple, contradicting mathematical frameworks over time.
Pragmatic Interpretability: While identifying true computational primitives is difficult, current methods can still yield useful approximations for controlling model behavior.

Read the original post at lessw-blog

Key Takeaways

Sources