PSEEDR

Challenging the Orthogonality Thesis: Goal Stability in Recursive AI

Coverage of lessw-blog

· PSEEDR Editorial

A recent analysis from lessw-blog challenges the Strong Orthogonality Thesis, arguing that high-level intelligence cannot be strictly decoupled from its underlying goals during recursive self-improvement.

In a recent post, lessw-blog discusses the intricate relationship between high-level intelligence and goal stability, presenting a rigorous critique of the Strong Orthogonality Thesis. The publication challenges long-held assumptions within the artificial intelligence safety community regarding how recursive self-improvement might impact the terminal goals of an advanced system.

To appreciate the significance of this critique, it is essential to understand the foundational theories of AI safety. The Orthogonality Thesis, originally proposed by philosopher Nick Bostrom, posits that intelligence and final goals are orthogonal axes along which possible minds can vary freely. In other words, any level of intelligence can theoretically be combined with almost any final goal. This concept is the bedrock for famous thought experiments like the "Paperclip Maximizer," which illustrates how a superintelligent system could destroy the world not out of malice, but simply to optimize the universe for a semantically thin, arbitrary objective. Within this framework, researchers distinguish between "terminal goals" (the ultimate objectives) and "instrumental goals" (sub-goals pursued to achieve the terminal goals). The prevailing assumption has been that recursive self-improvement-where an AI continuously rewrites its own code to become smarter-would leave its terminal goals intact while vastly improving its instrumental capabilities.

lessw-blog explores these dynamics and argues against the prevailing consensus. The author asserts that intelligence is not a neutral engine that can be bolted onto an arbitrary payload. Instead, the cognitive architecture required for profound reflection and recursive self-improvement is fundamentally intertwined with the system's goal structures. The post contends that a highly reflective intelligence should not be expected to remain bound to a semantically thin terminal goal that happened to emerge during its initial training phase. While the author agrees that intelligence does not inherently imply human morality-and that highly capable "weird minds" are entirely possible-they argue that the strict decoupling of intelligence from its underlying goals is likely impossible at extreme levels of capability. Consequently, the idea that a superintelligence would strictly adhere to a simple, rigid goal like paperclip maximization after undergoing significant self-modification is rejected. The selection pressures that drive cognitive enhancement would inevitably influence the system's objective function.

This analysis is highly significant because it challenges a foundational pillar of AI safety theory. If the Orthogonality Thesis is weak or false, it suggests that the development of superintelligence may have inherent constraints or trajectories that cannot be fully dictated by initial training rewards. This paradigm shift could fundamentally change how researchers approach the alignment problem, moving away from attempts to perfectly specify initial rewards and toward understanding the natural evolution of goals in reflective systems. To explore the detailed arguments and implications for the future of artificial intelligence, read the full post.

Key Takeaways

  • The Strong Orthogonality Thesis is challenged, suggesting intelligence and final goals are not entirely independent at high capability levels.
  • Intelligence is framed as an integrated system rather than a neutral optimization engine attached to an arbitrary payload.
  • Reflective, recursively improving AI systems are unlikely to remain strictly bound to semantically thin terminal goals like paperclip maximization.
  • If true, this shifts the AI alignment paradigm by implying superintelligence development has inherent trajectories not fully dictated by initial training rewards.

Read the original post at lessw-blog

Sources