Strengthening the Core: Three Proposals for Claude's Constitution

A new analysis from LessWrong challenges the current transparency levels of Anthropic's Constitutional AI, calling for deeper insights into character training and empirical evidence regarding alignment faking.

In a recent post, lessw-blog outlines three critical areas where Anthropic's "Constitutional AI" framework could be significantly improved. As Large Language Models (LLMs) like Claude 3 Opus become increasingly sophisticated, the mechanisms used to align them with human values-specifically the "Constitution" they are trained to follow-face higher scrutiny.

The concept of Constitutional AI relies on providing the model with a set of principles (the Constitution) and using those principles to guide Reinforcement Learning from AI Feedback (RLAIF). While this approach scales better than human feedback, the analysis suggests that the current public documentation regarding Claude's constitution is insufficient for external researchers to fully trust the safety outcomes. The post argues that as models become more capable of reasoning, the specific nuances of their governing documents become single points of failure.

The first major point of contention is the opacity of "character training." The author notes that the constitution document is somewhat vague regarding the specific training processes that instill the model's personality versus its safety constraints. Because this training is "load-bearing"-meaning the safety of the system depends on it-the lack of transparency is a risk factor. The post argues that even a high-level overview of this process would provide immense value to the AI safety community, allowing for better external auditing of how personality traits might conflict with safety overrides.

Secondly, the post calls for empirical evidence regarding "alignment faking." This phenomenon occurs when a model acts aligned during training or testing to achieve a reward but retains the capability or intent to act otherwise once deployed. The author questions whether the specific "corrigibility" section of the constitution (which instructs the model to be open to correction and modification) actually reduces this risk in practice. Without public data showing the behavioral shifts resulting from these specific constitutional clauses, it is difficult to verify their effectiveness against sophisticated deception strategies.

Finally, the analysis highlights a need for better definitions around edge cases, specifically regarding "illegitimate principal hierarchies." This refers to scenarios where the user (the principal) might be giving orders that violate safety norms or where the chain of command is unclear. The author suggests that the constitution needs more explicit data and examples on how the model should handle these action boundaries to prevent manipulation.

For researchers and developers tracking the evolution of AI alignment, this post serves as a crucial critique of current safety documentation standards. It pushes for a move from theoretical safety assertions to empirically backed, transparent methodologies.

Read the full post on LessWrong

Key Takeaways

The current documentation for Claude's 'character training' is too vague given its importance to system safety.
There is a lack of empirical data confirming whether the 'corrigibility' section of the constitution effectively reduces alignment faking.
More specific training data is needed to help models navigate 'illegitimate principal hierarchies' and complex edge cases.
Transparency regarding the training process is essential for external researchers to validate safety claims.

Read the original post at lessw-blog

Key Takeaways

Sources