A Critical Look at the "Evolution Argument" in AI Safety
Coverage of lessw-blog
In a recent analysis, a LessWrong contributor dismantles a prevalent analogy used to explain AI misalignment, arguing that relying on the comparison between human evolution and machine learning weakens the broader case for AI safety.
In a recent post, lessw-blog (LessWrong) published a critique titled "The Evolution Argument Sucks," which challenges one of the most frequently cited intuition pumps in the field of AI alignment. The post dissects the "evolution argument," a standard line of reasoning used to explain why advanced AI systems might pursue goals that conflict with their creators' intentions.
The Context
Within AI safety circles, the evolution analogy is often used to illustrate the concept of "inner alignment" failure. The argument generally proceeds as follows: Evolution "programmed" humans with the objective of inclusive genetic fitness (gene replication). However, humans do not explicitly optimize for gene replication; instead, we optimize for proxy goals like food, connection, and status. Consequently, in the modern environment, human behavior often diverges from the original evolutionary goal (e.g., using contraception). Proponents of this argument suggest that AI training is analogous to evolution: developers specify a reward function (the "gene replication" equivalent), but the AI will likely learn internal proxy goals that, while useful during training, will diverge from the developers' intent in novel situations.
The Signal
The author of the post contends that while the conclusion-that AI is dangerous-may be correct, this specific argument used to reach it is "fundamentally and irreparably flawed." The critique focuses on the epistemic dangers of relying on loose analogies. The author argues that simply because humans diverged from evolution's "goals" does not necessitate that gradient descent processes will produce agents that diverge from human-specified goals in the same specific way. The post utilizes a reductio ad absurdum involving war casualty predictions to demonstrate how selecting an inappropriate reference class for an analogy can yield convincing but ultimately baseless predictions.
This analysis is significant because it calls for higher standards of rigor in safety arguments. By identifying and discarding weak analogies, the community can focus on more robust technical and theoretical reasons for expecting misalignment, rather than relying on a narrative that may not hold up to scrutiny.
Why It Matters
For researchers and policymakers, this post serves as a reminder to audit the mental models used to justify risk scenarios. If the "evolution argument" is indeed flawed, stakeholders must ensure their risk assessments are grounded in technical realities of machine learning rather than attractive but potentially misleading biological metaphors.
Read the full post on LessWrong
Key Takeaways
- The "evolution argument" suggests AI will diverge from human goals just as humans diverged from evolution's goal of gene replication.
- The author argues this analogy is structurally flawed and fails to provide a predictive basis for AI behavior.
- Critiquing this argument is intended to strengthen AI safety discourse by removing weak reasoning, not to dismiss the underlying risk.
- The post highlights the danger of using analogies without establishing strict representativeness between the classes being compared.