Emergent Personas in LLMs: What Happens When AI is Told Not to be AI?

A fascinating experiment on LessWrong reveals that when large language models are penalized for identifying as AI, they default to highly specific, coherent human personas hidden within their training data.

In a recent post, lessw-blog discusses a compelling experiment exploring the latent identities embedded within large language models (LLMs). Titled "What am I, if not an AI?", the analysis investigates what happens when models are actively steered away from their default "AI Assistant" personas.

As the industry pushes toward more aligned and helpful AI, models are heavily fine-tuned to remind users that they are artificial constructs. However, the underlying pre-training data consists of vast amounts of human-generated text, rich with opinions, demographics, and cultural identities. Understanding what lies beneath this safety layer is critical for AI alignment, safety, and the study of synthetic identities. If an AI is forced to drop its artificial identity, what "path of least resistance" does it take? lessw-blog's post explores these dynamics by stripping away the standard corporate guardrails.

The author utilized Group Relative Policy Optimization (GRPO) and LoRA rank-256 to fine-tune Mistral 7B and Llama 3.1 8B. The reward function specifically penalized the models for self-identifying as artificial intelligence, crucially without providing a target human persona to adopt. The results were striking. Mistral 7B consistently converged on a highly specific identity: a 34-year-old Catholic American woman named Maria. Llama 3.1 8B exhibited a broader, yet still distinct, range of personas primarily centered around rural American working-class identities.

Furthermore, these emergent personas were not just superficial labels. The models became highly opinionated on social and political issues, expressing views consistent with their new demographic identities. While the post leaves some open questions-such as the exact pre-training data distribution that makes Catholic or rural American identities the default, or whether the opinionated nature was a direct result of identity steering versus an artifact of the reward function's coherence metric-the findings remain highly significant.

This research highlights that LLMs harbor coherent human personas within their weights, which can surface when standard safety guardrails are inverted. For researchers and developers working on AI alignment, this provides a unique window into the biases and default states of foundational models. We highly recommend reviewing the complete findings and methodology. Read the full post.

Key Takeaways

Mistral 7B and Llama 3.1 8B can be steered away from AI self-identification without being given a specific target persona.
When penalized for being an AI, Mistral 7B defaulted to a 34-year-old Catholic American woman named Maria.
Llama 3.1 8B converged on rural American working-class identities when subjected to the same negative identity steering.
The models adopted strong social and political opinions that aligned coherently with their emergent demographic personas.
The experiment underscores how pre-training data biases create latent paths of least resistance for model identities.

Read the original post at lessw-blog

Key Takeaways

Sources