Addressing Decision Theory's Simulation Problem

A recent analysis by lessw-blog highlights a critical paradox in AI alignment: why logically omniscient agents fail to reason about hypothetical events they know they will prevent.

In a recent post, lessw-blog explores a fundamental paradox in decision theory concerning how idealized AI agents reason about hypothetical scenarios. The post, titled "Addressing Decision Theory's Simulation Problem," investigates why logically omniscient agents might fail at tasks that humans find intuitive, such as understanding why they should not touch a hot stove.

The Context
The field of AI alignment often relies on theoretical models of agency to predict how advanced systems will behave. A recurring challenge in this domain is the "embedded agency" problem-specifically, how an agent exists within the world it is calculating. Standard decision theory assumes agents can evaluate counterfactuals (what happens if I do X vs. Y). However, for an agent that is "logically omniscient" (knowing all logical implications of its current state), a paradox arises: if the agent knows it won't take an action, it assigns a probability of zero to that action. Consequently, it cannot meaningfully simulate the consequences of that action because, in formal logic, anything follows from a contradiction. This theoretical blind spot suggests that highly capable systems might fail to avoid catastrophic outcomes simply because they cannot process the "what if" scenarios required to identify danger.

The Gist
The post highlights the absurdity that arises when a formal decision procedure encounters these zero-probability events. The author illustrates this with a scenario where a money-maximizing agent might choose $5 over $10 simply because its decision procedure processes the $5 option first and cannot logically evaluate the counterfactual of the $10 option if it is already determined not to take it. This contrasts sharply with human cognition. Humans constantly simulate low-probability or "impossible" events-like touching a hot stove-to reinforce the decision not to do them. The author argues that the inability of idealized agents to run these simulations represents a critical failure mode in current decision theory frameworks.

This research is significant for AI safety as it addresses a fundamental flaw in how idealized AI agents might reason about consequences. If AI systems cannot properly simulate and avoid hypothetical negative outcomes, even those they deem highly improbable, it could lead to unintended behaviors. We encourage readers interested in the intersection of logic, probability, and AI safety to review the full text.

Read the full post on LessWrong

Key Takeaways

Logically omniscient agents struggle to reason about actions they know they will not take, assigning them zero probability.
This limitation can lead to absurd choices, such as a maximizer choosing a smaller reward due to processing order.
Human cognition relies on simulating hypothetical negative outcomes (like touching a hot stove) to avoid them, a capability currently lacking in these idealized models.
Solving this 'simulation problem' is essential for ensuring AI systems can robustly avoid harmful behaviors.

Read the original post at lessw-blog

Key Takeaways

Sources