Making Linear Probes Interpretable: A New Approach to Activation Steering
Coverage of lessw-blog
A look at how combining Sparse Autoencoders with ElasticNet regularization can turn opaque linear probes into transparent, steerable mechanisms.
In a recent post, lessw-blog explores a novel methodology for enhancing the interpretability of linear probes within Large Language Models (LLMs). The analysis focuses on bridging the gap between raw activation data and semantic understanding by leveraging Sparse Autoencoder (SAE) features.
Linear probes have long been a staple in mechanistic interpretability, allowing researchers to determine if specific information is linearly encoded within a model's activation space. However, a traditional linear probe often results in a direction vector composed of unintelligible floating-point numbers. While effective for classification, these vectors fail to explain which specific semantic concepts the model is utilizing. As the field moves toward safer and more controllable AI, the ability to decompose these vectors into human-understandable components-and subsequently use them for precise activation steering-is becoming increasingly vital.
The author proposes a technique termed "Supervised Feature Selection." Instead of training probes on raw model activations, the method involves training them directly on the activations of SAE features derived from contrastive examples. By utilizing ElasticNet regularization, which enforces sparsity, the process effectively zeros out the majority of feature weights. The remaining non-zero weights correspond to specific, interpretable SAE features that distinguish the target concept.
This approach transforms the probe from a "black box" vector into a weighted list of semantic features. Furthermore, it allows researchers to construct steering vectors by combining SAE decoder directions weighted by the probe's coefficients. The post also addresses practical implementation details, such as handling "dirty" probes by refining input data or manually inverting specific feature weights.
For researchers working on AI alignment and model transparency, this method offers a concrete step toward moving beyond abstract numerical representations to actionable, semantic insights.
Read the full post on LessWrong
Key Takeaways
- Linear probes are trained on SAE feature activations rather than raw activations to isolate semantic meaning.
- ElasticNet regularization is employed to enforce sparsity, ensuring only the most relevant features retain non-zero weights.
- The resulting probe weights allow for the construction of interpretable steering vectors based on SAE decoder directions.
- This method, "Supervised Feature Selection," facilitates the identification and removal of confounding features in "dirty" probes.