Rotations in Superposition: Validating Neural Circuit Theory

In a recent technical analysis published on LessWrong, the author details an experimental validation of theoretical frameworks regarding "circuits in superposition."

The concept of superposition-where neural networks represent more features than they have neurons by utilizing high-dimensional geometry-has become a cornerstone of mechanistic interpretability research. While much of the field focuses on reverse-engineering these phenomena from trained "black box" models, this post takes a constructive approach by attempting to build these structures from the ground up.

The author reports on the successful hand-coding of a Multi-Layer Perceptron (MLP) designed specifically to implement conditional rotations on two-dimensional input features. By manually setting the weights rather than relying on gradient descent, the work serves as a proof-of-concept for existing mathematical frameworks describing how circuits operate within superposition. This approach tests whether current theoretical models are precise enough to predict and construct specific network behaviors.

This experiment is significant because it bridges the gap between abstract theory and concrete implementation. It demonstrates that our mathematical understanding of interference and feature manipulation in superposition is robust enough to be engineered explicitly. The project, supported by Coefficient Giving and Goodfire AI, suggests a deepening control over neural network internals, moving beyond observation toward intentional design.

For researchers in AI safety and interpretability, this post offers a practical demonstration of how complex computational geometry can be instantiated in simple architectures. The author has also made the associated code publicly available on GitHub for verification and further experimentation.

Read the full post on LessWrong

Key Takeaways

The post provides experimental validation for mathematical frameworks regarding circuits in superposition.
Instead of training, the author hand-coded the weights of an MLP to achieve specific behaviors.
The experiment successfully implemented conditional rotations on 2D input features within the network.
This work demonstrates a transition from theoretical observation to engineering control in interpretability research.
Code for the hand-coded MLP and the experiments is available on GitHub.

Read the original post at lessw-blog

Key Takeaways

Sources