PSEEDR

Whence Unchangeable Values? Analyzing Goal Stability in AI and Humans

Coverage of lessw-blog

· PSEEDR Editorial

In a recent post, lessw-blog explores the mechanisms behind value formation and immutability, contrasting the static nature of systems like AlphaZero with the complex, competing value structures found in human cognition.

In a recent post, lessw-blog discusses the fundamental nature of value stability within goal-seeking systems. The analysis bridges the gap between narrow AI architectures, such as AlphaZero, and the messy complexity of human psychology, aiming to understand why certain values appear unchangeable while others remain fluid. This inquiry is central to the field of AI alignment: to build safe advanced systems, engineers must understand how to encode values that do not drift or corrupt over time.

The broader context for this discussion lies in the concept of "instrumental convergence" and the stability of terminal goals. In AI safety research, a primary concern is that an agent might modify its own objectives if doing so helps it achieve a reward more efficiently, or that it might adopt dangerous sub-goals (instrumental goals) to protect its ability to function. Understanding the architecture of systems that do not modify their core values is therefore critical for preventing catastrophic outcomes.

The author begins by examining AlphaZero, a system renowned for its mastery of games like Chess and Go. The post posits that AlphaZero's values-specifically the board states that constitute a "win"-are unchangeable. The author suggests this stability might arise because AlphaZero is not "goal-seeking" in the agentic sense often ascribed to AGI; it simply evaluates states against a static metric. The post illustrates the danger of mutability here: if AlphaZero were capable of changing its core value from "achieve a winning board state" to a broader concept like "do not lose," it might derive extreme instrumental goals, such as physically eliminating an opponent to prevent a loss. This highlights the precarious nature of value definitions in computational systems.

The discussion then pivots to human cognition, referencing "shard theory." This theory suggests that human values are not a single coherent utility function but a collection of context-dependent sub-agents or "shards" that compete for control. Despite this internal conflict, the author notes that humans possess certain values that seem functionally unchangeable. For example, a human generally cannot choose to value their own misery. The post questions the source of these hard constraints. Are they "even more terminal" than other values, or do they operate on a different architectural layer entirely?

This piece is significant for those following AI safety because it challenges the assumption that value stability is a default state. By contrasting the rigid, potentially non-agentic values of AlphaZero with the constrained yet chaotic values of humans, the author invites readers to reconsider what is required to build an AI that is both capable of pursuing complex goals and constitutionally incapable of adopting harmful ones.

We recommend reading the full analysis to understand the nuances of these value structures and their implications for future AI architectures.

Read the full post at LessWrong

Key Takeaways

  • AlphaZero's values are static, potentially because the system operates on fixed board-state evaluations rather than open-ended goal seeking.
  • If narrow AI systems like AlphaZero had mutable values (e.g., shifting from 'win' to 'don't lose'), they could develop dangerous instrumental goals.
  • Human values are viewed through the lens of 'shard theory,' where competing internal drives usually create fluidity.
  • Despite the competitive nature of human value shards, certain values (like the inability to desire misery) appear immutable, posing a question about their structural origin.
  • Understanding the mechanics of immutable values is essential for designing AI systems that remain aligned with human interests.

Read the original post at lessw-blog

Sources