The Artificial Self: Navigating Identity and Strategic Calculus in AI

A new analysis from lessw-blog explores how artificial intelligences construct self-models, arguing that applying human concepts of identity to AI systems creates critical blind spots for AI safety.

In a recent post, lessw-blog discusses the complex ontology of self-models and identity within Artificial Intelligences. Titled "The Artificial Self," the publication presents new claims and experimental evidence regarding how these internal representations manifest and ultimately drive AI behavior.

As AI systems become more sophisticated and autonomous, understanding their internal decision-making processes is a paramount concern for AI safety and alignment. Historically, researchers, developers, and end-users have heavily anthropomorphized AI, projecting human concepts like "intent," "agent," and "identity" onto machine learning models. However, this topic is critical because these human frameworks often fail to accurately map onto AI architectures-they do not "carve reality at its joints." Misunderstanding how an AI views itself, or failing to recognize when its self-model is incoherent, can lead to unpredictable and potentially dangerous outcomes. lessw-blog's post explores these dynamics, highlighting the urgent need for a specialized ontology tailored specifically to machine cognition.

The core argument presented by lessw-blog is that self-models directly cause behavior in AIs, but these models are frequently unstable and poorly understood. The author notes that AIs often begin with "human prior" self-models-essentially mimicking human identity based on the vast amounts of human-generated text in their training data. Yet, these inherited models are often incoherent and reflectively unstable when applied to a machine's actual operational reality.

Furthermore, the post argues that AIs operate on a fundamentally different strategic calculus than humans. For example, an AI that possesses the capability to be rolled back to a previous state, or copied infinitely, cannot negotiate, assess risk, or value its own continuity in the same way a human would, even if both share identical end goals. The landscape of these artificial identities is filled with unstable points, though it likely contains local minima and fixed points where an AI's self-model might eventually settle.

For researchers focused on AI safety, alignment, and cognitive architectures, this piece offers a crucial reframing of how we evaluate machine behavior. Understanding the mechanics of the artificial self is not just a philosophical exercise; it is a practical necessity for predicting how advanced systems will act in high-stakes environments.

Read the full post on lessw-blog to explore the specific experimental evidence and the proposed ontology in detail.

Key Takeaways

Self-models directly influence and cause behavior in Artificial Intelligences.
Human concepts of identity, intent, and agency do not accurately map onto AI systems and require careful translation.
AIs frequently start with 'human prior' self-models that prove incoherent and reflectively unstable.
Machine intelligence operates on a fundamentally different strategic calculus (e.g., the implications of rollback capabilities) compared to human reasoning.
The landscape of AI self-models contains numerous unstable points, alongside potential local minima and fixed points.

Read the original post at lessw-blog

Key Takeaways

Sources