A Black Box Made Less Opaque: Exploring Sparse Autoencoders on GPT-2

A recent analysis on LessWrong utilizes residual stream sparse autoencoders (SAEs) to decompose the internal processing of GPT-2 small, offering a practical demonstration of mechanistic interpretability concepts.

In a recent post, lessw-blog presents an analysis titled "A Black Box Made Less Opaque (part 1)." As Large Language Models (LLMs) become increasingly integrated into critical infrastructure, the "black box" problem-our inability to fully understand the internal mechanisms driving model outputs-remains a significant safety hurdle. Mechanistic Interpretability (MI) seeks to solve this by reverse-engineering neural networks to identify the specific circuits responsible for behaviors. This post contributes to that effort by applying residual stream sparse autoencoders (SAEs) to GPT-2 small, aiming to illustrate fundamental concepts like feature identification and activation geometry.

The author focuses on replicating pioneering MI analysis at a manageable scale to document how interpretable features behave. By examining how the model processes specific text strings, the analysis tracks how "features"-interpretable units of computation extracted by the SAE-evolve across different layers. The study observes that both peak (single most active feature) and aggregate (total of top 5) activation levels tend to increase proportionally as the input is transformed by the model's layers. Furthermore, the specific features that trigger the highest activation change from layer to layer, suggesting a dynamic reshuffling of internal priorities as the model computes its final output.

A key part of the investigation involves "specialist scores," which measure how selective a feature is for specific concepts. The results here were mixed; while some categories, such as social interactions, activated progressively more specialized features in later layers, others did not follow a clear pattern. This highlights the complexity of mapping human concepts onto neural weights, even in smaller, older architectures like GPT-2.

While the author notes low confidence in directly applying these specific findings to modern, massive models, the work serves as a crucial educational tool. It documents the practical application of SAEs, helping to democratize the understanding of how researchers might eventually control and align more powerful AI systems.

For researchers and engineers interested in the nuts and bolts of AI safety and interpretability, this post offers a clear, step-by-step look at the current methodology.

Read the full post on LessWrong

Key Takeaways

The analysis applies residual stream sparse autoencoders (SAEs) to GPT-2 small to visualize internal model mechanics.
Activation levels for features generally increase as the input progresses through the model's layers.
The most active features change between layers, indicating a reshuffling of computational focus.
Specialist scores showed mixed results; some categories became more specialized in later layers while others did not.
The post serves as a foundational guide for understanding Mechanistic Interpretability, despite the use of an older model architecture.

Read the original post at lessw-blog

Key Takeaways

Sources