The Surface-Level Trap: Do LLMs Actually Learn Human Values?
Coverage of lessw-blog
In a compelling analysis of the 'Deep Value Benchmark,' a recent post on LessWrong highlights research by Ashkinaze et al. (2025) that questions the fundamental efficacy of current AI training methods regarding value alignment.
In a recent post, lessw-blog discusses a critical piece of research titled "Do LLMs Learn Our Preferences or Just Our Behaviors?" based on the Deep Value Benchmark paper by Ashkinaze et al. (2025). This discussion strikes at the heart of a primary assumption in modern AI development: that when we train Large Language Models (LLMs) on human preferences, the models internalize the underlying values driving those preferences. The analysis suggests that this assumption may be dangerously flawed.
The context for this research is the widespread reliance on Reinforcement Learning from Human Feedback (RLHF) to align models. The hope has always been that by showing a model examples of "good" and "bad" responses, the model generalizes the abstract moral or utility function behind the decision. However, the highlighted research indicates that models are far more likely to latch onto spurious correlations-surface-level features-rather than the deep semantic values intended by the developers.
To test this, the researchers employed a rigorous methodology involving confounded training data. They created scenarios where specific moral values were tied to distinct stylistic choices. For example, "kindness" might always be presented in a "formal" style, while "fairness" was presented in a "casual" style. The objective was to see what the model would do when presented with a test case that separated the value from the style. If the model truly learned "kindness," it should be able to express it casually. If it merely learned the behavior, it would associate the formal style with the positive reward, regardless of the moral content.
The results reported in the post are concerning for alignment researchers. At test time, models consistently followed the surface features (style) rather than the underlying moral values. Across nine different models, the rate of generalizing based on the actual value averaged only 0.30-a performance significantly worse than random chance. This implies that the models were actively misled by the stylistic cues, treating the syntax as the signal and the moral value as noise.
Crucially, the post notes that this is not a capability issue. When the models were explicitly told which option embodied which value, they could identify them correctly. The failure lies in the learning process from preference data; the models default to the path of least resistance, which in language modeling is often the tone or structure of the text rather than its semantic intent. This distinction is vital because it suggests that current benchmarking methods, which often lack these adversarial controls, may be giving us a false sense of security regarding how aligned our systems actually are.
This analysis serves as a significant signal for anyone working in Foundation Models or AI Safety. It challenges the robustness of behavioral cloning and RLHF, suggesting that without new training methodologies that explicitly disentangle style from substance, we may be building systems that mimic human values without understanding them.
We highly recommend reading the full analysis and the associated paper to understand the specific methodologies used.
Read the full post at LessWrong
Key Takeaways
- The study utilized a 'Deep Value Benchmark' to test if LLMs learn abstract values or just surface-level behaviors.
- Researchers confounded moral values (e.g., kindness) with stylistic features (e.g., formality) in training data to test generalization.
- Models consistently prioritized surface features over values, with a generalization rate of ~0.30 (worse than chance).
- The failure is in the learning process, not capability; models can identify values when explicitly prompted but fail to extract them from preference data naturally.
- The findings suggest current alignment techniques like RLHF may be optimizing for style mimicry rather than value internalization.