Pro or Average Joe? How LLMs Internally Judge Technical Competence
Coverage of lessw-blog
In a recent technical analysis on LessWrong, the author investigates how Large Language Models (LLMs) infer a user's programming skills and whether this internal judgment can be identified and controlled.
In a recent post, LessWrong user lessw-blog discusses a compelling area of mechanistic interpretability: the internal "user models" that Large Language Models (LLMs) construct during interaction. While it is generally observed that LLMs adapt their tone and complexity based on user input, the specific mechanisms driving this adaptation have remained somewhat opaque. The author explores whether models merely match patterns or if they maintain a distinct, retrievable representation of a user's technical proficiency-specifically within the context of Python programming.
The Context: Implicit User Modeling
As LLMs are increasingly integrated into development environments and educational tools, understanding how they perceive users is critical. If a model infers that a user is a novice based on a simple question, it might over-simplify its answer or withhold advanced, more efficient solutions. Conversely, accurately detecting expertise allows the model to cut through preamble and deliver high-level technical specifications. Understanding where and how this judgment occurs within the model's neural architecture is the first step toward building more reliable and steerable AI assistants.
The Analysis: Probing for Expertise
The post details an experiment designed to locate the model's internal "expertise switch." By utilizing linear probes-classifiers trained on the model's internal activation states-the author demonstrates that LLMs do indeed form a robust representation of user ability. The research highlights that these probes can detect the model's assessment of the user (Expert vs. Novice) with high accuracy, peaking around layer 16 of the tested architecture.
Crucially, the analysis reveals that these internal probes are often more accurate at determining the model's stance than simply asking the model via a prompt. This suggests a disconnect between the model's internal state and its generated output, a known phenomenon in interpretability research. The author further explores the specific components involved, analyzing token unembedding weights and Sparse Autoencoder (SAE) decoder features to map how abstract concepts of skill are encoded mathematically.
Steering the Model
Perhaps the most significant finding is the potential for control. The post demonstrates that this internal representation is not just a passive observation but a steerable variable. By manipulating the activation vectors associated with expertise, it is possible to force the model to treat a user as an expert or a novice, regardless of the actual complexity of the user's input. This opens new avenues for "controllable AI," where users could explicitly set the persona of their assistant rather than relying on the model's implicit (and potentially flawed) inference.
For developers and researchers working on AI alignment and personalized systems, this deep dive offers concrete evidence that user modeling is a tangible, manipulatable feature of current transformer architectures.
Read the full post on LessWrong
Key Takeaways
- LLMs create accurate internal representations of a user's technical expertise (specifically in Python).
- Linear probes trained on internal layers (e.g., layer 16) are more effective at detecting this judgment than prompting the model directly.
- The model's perception of user skill directly influences its downstream behavior and response generation.
- It is possible to manipulate these internal states to 'steer' the model, forcing it to respond as if the user were an expert or a novice.