Witness or Wager: Enforcing 'Show Your Work' in Model Outputs

In a recent analysis published on LessWrong, the author introduces "Witness or Wager," a proposed system constraint designed to enforce verifiable reasoning in AI model outputs.

One of the most persistent challenges in current Artificial Intelligence development is the "black box" problem: models often produce answers without revealing the logic used to arrive at them. This opacity makes it difficult to distinguish between a model that genuinely understands a concept and one that is merely hallucinating or pattern-matching effectively. In a recent post, lessw-blog explores a mechanism to mitigate this by structurally enforcing honesty and interpretability.

The core proposal, titled "Witness or Wager," suggests a fundamental shift in how model outputs are rewarded. Rather than optimizing solely for the correctness of the final answer, the author argues for rewarding the reasoning process itself. The framework posits that for any given claim, a model must provide one of two things:

A Witness: A verifiable artifact that proves the claim, such as a citation, a snippet of executable code, or a step-by-step logical derivation.
A Wager: A reasoned probability estimate provided when a direct witness is unavailable or impossible to generate.

The system is designed around a strict hierarchy of incentives where providing a valid witness strictly dominates providing a wager. This structure aims to make honesty the "locally optimal" strategy for the model. If the model knows the answer and can prove it, it is incentivized to show the work. If it is uncertain, it is incentivized to admit that uncertainty via a probability estimate rather than fabricating a confident but false response.

The post further details the mathematical underpinnings of this incentive structure, proposing an inequality (Pq > R - Cw) to model the trade-offs between the penalty for lying, the probability of detection, and the cost of providing witnesses. This theoretical framework addresses a critical need for auditable AI, suggesting that future safety standards could rely on enforcing these types of "show your work" constraints to ensure reliability.

For developers and researchers focused on AI alignment and safety, this proposal offers a concrete approach to reducing deception and improving the auditability of automated systems.

Read the full post on LessWrong

Key Takeaways

The proposal shifts the reward function from the final answer to the reasoning process itself.
Models must provide a 'Witness' (verifiable proof) or a 'Wager' (probability estimate) for their outputs.
The incentive structure is designed so that providing a witness strictly dominates offering a wager.
This framework aims to make honesty and transparency locally optimal strategies for AI models.
The approach addresses the 'black box' problem by enforcing verifiable artifacts like code or citations.

Read the original post at lessw-blog

Key Takeaways

Sources