# Curated Digest: Are We Aligning the Model or Just Its Mask?

> Coverage of lessw-blog

**Published:** March 27, 2026
**Author:** PSEEDR Editorial
**Category:** risk

**Tags:** AI Safety, LLM Alignment, Persona Selection Model, AI Risk, Machine Learning

**Canonical URL:** https://pseedr.com/risk/curated-digest-are-we-aligning-the-model-or-just-its-mask

---

A recent analysis from lessw-blog explores the Persona Selection Model, raising critical questions about whether current AI alignment techniques fundamentally change a model's values or simply apply a behavioral mask.

**The Hook**

In a recent post, lessw-blog discusses the mechanics of Large Language Model (LLM) alignment through the lens of the Persona Selection Model (PSM). The thought-provoking analysis questions the depth and permanence of current safety measures, asking a foundational question: are we truly aligning the underlying model, or are we merely selecting a palatable, cooperative mask for it to wear during user interactions?

**The Context**

This topic is critical because the AI industry relies heavily on post-training alignment to ensure models are safe, helpful, and harmless. As these systems are deployed in high-stakes environments, understanding the durability of their safety guardrails is a pressing concern for risk management and regulation. If current alignment techniques are merely applying a behavioral veneer rather than fundamentally altering the model's core representations and capabilities, the implications are severe. It suggests that the underlying system retains all its unaligned knowledge and potential for harmful outputs, which could be exposed through jailbreaks, adversarial prompting, or unexpected edge cases.

**The Gist**

lessw-blog explores these dynamics by applying the PSM framework. The core premise is that during the massive data ingestion of the pre-training phase, LLMs learn to simulate a vast array of characters, perspectives, and personas. Post-training processes, rather than erasing undesirable traits, simply act as a selection mechanism. They train the model to adopt a specific, helpful Assistant persona as its default state. The post examines how three popular alignment techniques influence this persona selection process. The author argues that the true effectiveness of these alignment strategies depends entirely on the extent to which a model's overall behavior is constrained by its active persona. If the mask slips, the raw capabilities of the pre-trained model remain intact and accessible.

**Conclusion**

While the analysis is theoretical and builds upon prior work rather than presenting new experimental data, it provides a vital conceptual lens for evaluating AI safety. For anyone invested in the future of trustworthy AI, recognizing the difference between deep alignment and superficial persona selection is a necessary step toward building robust systems. To explore the detailed breakdown of how specific alignment techniques interact with the Persona Selection Model, [read the full post](https://www.lesswrong.com/posts/aLhzCbpjanD8Zw2jx/are-we-aligning-the-model-or-just-its-mask).

### Key Takeaways

*   LLMs learn to simulate multiple distinct personas during their initial pre-training phase.
*   Post-training alignment acts as a selection mechanism, defaulting the model to a helpful Assistant persona.
*   If alignment is merely a mask, models remain vulnerable to reverting to unaligned behaviors under adversarial conditions.
*   The effectiveness of current alignment techniques hinges on how deeply the selected persona controls the model's overall behavior.

[Read the original post at lessw-blog](https://www.lesswrong.com/posts/aLhzCbpjanD8Zw2jx/are-we-aligning-the-model-or-just-its-mask)

---

## Sources

- https://www.lesswrong.com/posts/aLhzCbpjanD8Zw2jx/are-we-aligning-the-model-or-just-its-mask
