Curated Digest: Rethinking CEV and Human Alignment Through Extreme Examples

A recent post on LessWrong challenges conventional assumptions in AI alignment by applying the concept of Coherent Extrapolated Volition (CEV) to a highly controversial real-world figure, prompting a deeper look at inherent human nature versus situational behavior.

In a recent post, lessw-blog discusses the application and interpretation of "Coherent Extrapolated Volition" (CEV) in the context of a highly controversial real-world figure: Vladimir Putin.

This topic is critical for the broader AI safety and alignment landscape. CEV-a foundational concept originally introduced in AI alignment theory-proposes that an advanced artificial intelligence should act based on what humans would want if we knew more, thought faster, were more the people we wished we were, and had grown up farther together. Often, alignment discussions operate on the assumption that the CEV of a demonstrably bad actor would inherently result in catastrophic or "evil" outcomes. Understanding whether human values converge on something positive when idealized, even starting from a deeply flawed baseline, is essential for modeling how AI systems might interpret and extrapolate human ethics.

lessw-blog's post explores these dynamics by challenging the premise that an idealized version of Putin would still commit terrible acts. While the author explicitly acknowledges Putin's terrible real-world actions and apparent lack of a good moral compass, they question the reflexive assumption that his CEV would remain depraved. The analysis suggests that the belief in his inherent, unchangeable evil-even under the transformative, knowledge-expanding lens of CEV-might be heavily influenced by propaganda and a misunderstanding of how autocratic leadership correlates with fundamental human values. The author points out that while there is a strong correlation between autocratic leadership and harmful actions, extrapolating this to a fundamental, unalterable core of "evil" in a CEV scenario might be an analytical misstep.

By applying CEV to such an extreme and visceral example, the author prompts critical thinking about how alignment researchers conceptualize inherent human nature versus situational factors. It forces the reader to ask: are human values fundamentally divergent, or does the process of extrapolation naturally filter out the situational paranoia and zero-sum thinking that drives real-world autocrats?

For those interested in the philosophical underpinnings of AI alignment, the complexities of modeling human values, and the debate over inherent versus situational ethics, this analysis offers a highly thought-provoking perspective. Read the full post.

Key Takeaways

The author acknowledges Putin's real-world actions but questions if his idealized 'Coherent Extrapolated Volition' (CEV) would inherently remain evil.
The post challenges common AI alignment premises that assume the extrapolated volition of bad actors inevitably leads to catastrophic outcomes.
It suggests that assumptions about unchangeable, inherent depravity in extreme figures might be partially shaped by propaganda.
The analysis highlights the tension between situational autocratic behavior and fundamental human values within AI safety modeling.

Read the original post at lessw-blog

Key Takeaways

Sources