# The Epistemological Schism in AI Safety: Why Agent Foundations Rejects Empirical Grounding

> A historical allegory highlights the methodological divide between prosaic alignment engineering and a priori mathematical theories of agency.

**Published:** June 14, 2026
**Author:** PSEEDR Editorial
**Category:** risk
**Content tier:** free
**Accessible for free:** true
**Editorial format:** analysis
**News quality eligible:** true
**Source count:** 1
**Word count:** 1151


**Tags:** AI Safety, Agent Foundations, Machine Learning, Epistemology, Prosaic Alignment

**Canonical URL:** https://pseedr.com/risk/the-epistemological-schism-in-ai-safety-why-agent-foundations-rejects-empirical-

---

The debate over how to align artificial superintelligence has fractured into a deep epistemological schism between empirical engineering and abstract mathematics. In a recent essay titled ["Attack of the Killer Differential Equations"](https://www.lesswrong.com/posts/3QbyzrkqfoZwvSegu/attack-of-the-killer-differential-equations) published on LessWrong, the author argues that demanding empirical grounding for Agent Foundations research is a fundamental category error. For PSEEDR, this methodological divide highlights a critical tension in AI safety: whether iterative progress on current large language models will generalize to superintelligence, or if it represents a dangerous distraction from discovering the fundamental laws of agency.

## The Limits of the "Rocket Science" Analogy

The AI safety community frequently relies on the analogy that "alignment is like rocket science." The comparison is useful for illustrating the "one-shot" nature of the challenge: a rocket either reaches its destination or fails catastrophically, much like a superintelligent system will either be aligned or pose an existential threat, offering no opportunity to learn from a critical failure and try again. However, the LessWrong post argues that this analogy breaks down when attempting to explain the methodological obstinacy of Agent Foundations (AF) researchers.

If alignment were purely analogous to rocket science, the logical path forward would be to study the components of early rockets. In the context of AI, this translates to prosaic alignment-studying the empirical behavior of current large language models (LLMs) through techniques like mechanistic interpretability or reinforcement learning from human feedback (RLHF). Yet, AF researchers intentionally avoid this empirical work. The author illustrates this disconnect by imagining a scenario where scholars attempt to stop an incoming asteroid by studying the aerodynamics of cannonballs, while a lone researcher insists on discovering the abstract mathematical laws of "Throwability." While studying cannonballs provides empirical data, it does not necessarily yield the fundamental physical laws required to solve a novel, out-of-distribution problem.

## The Calculus Allegory and the "Cooling Coffee" Problem

To better articulate the Agent Foundations perspective, the author constructs a historical allegory set in 1666. Isaac Newton is conceptualizing a vague, abstract mathematical framework he calls "fluxions" (differential equations). He observes that fluxions govern phenomena like cooling coffee and swinging pendulums. Believing that a specific, unknown type of fluxion poses an imminent threat to England, Newton isolates himself to develop a general theory of fluxions.

Meanwhile, an entire industry of "interpretability researchers" emerges, dedicated to studying the specific properties of cooling coffee. When these empirical researchers demand that Newton ground his abstract theories in coffee data, he refuses. His reasoning forms the core thesis of the Agent Foundations approach: coffee might represent an extremely narrow and idiosyncratic region of "fluxion space."

In this allegory, current LLMs are the cooling coffee. They exhibit forms of agency and intelligence, but they are constrained by human training data, specific neural architectures, and current optimization algorithms. AF researchers argue that the broader "agency space" is vast. A future artificial superintelligence might operate on principles entirely alien to current transformer models. Therefore, studying the specific, idiosyncratic agency of an LLM might yield zero insights into the general laws of intelligence. From this viewpoint, demanding that abstract mathematical research be empirically grounded in today's models is a meaningless request.

## Implications for the AI Safety Ecosystem

This allegory exposes a fundamental epistemological rift within the AI safety ecosystem. On one side are the pragmatic, empirical engineering labs-such as Anthropic, OpenAI, and Google DeepMind. These organizations operate under the assumption that intelligence is an emergent property of scale and compute, and that safety can be engineered iteratively. Their methods-red-teaming, constitutional AI, and scalable oversight-rely heavily on inductive reasoning drawn from the behavior of current systems.

On the other side are organizations like the Machine Intelligence Research Institute (MIRI) and independent researchers focused on Agent Foundations. Their work involves highly abstract, a priori mathematical frameworks, such as logical induction, decision theory, and infra-Bayesianism. They operate under the assumption that superintelligence will require a paradigm shift, and that empirical data from current models is a deceptive local optimum.

The implications of this divide are significant for resource allocation and policy. If the empirical camp is correct, then massive investments in mechanistic interpretability and model evaluations are the optimal path to safety. If the AF camp is correct, the industry is currently wasting critical time optimizing the "cooling coffee" while the fundamental mathematics of agency remain unsolved. The AF approach represents a high-risk, high-reward wager: the acknowledgment that researchers will likely make no legible progress, but that pursuing a general theory is the only mathematically sound strategy for a one-shot existential risk.

## Limitations and Open Questions in the A Priori Approach

While the LessWrong post effectively explains the rationale behind Agent Foundations, it also inadvertently highlights the severe limitations of the approach. The most glaring open question is the bridge between abstract mathematical theory and practical AI architectures. Even if researchers successfully formalize a general theory of agency, there is no guaranteed mechanism to translate those differential equations into the weights and biases of a functioning, competitive AI system.

Furthermore, the allegory assumes that a general theory of "fluxions" actually exists. In the real world, calculus was successfully formalized. In the realm of artificial intelligence, there is no guarantee that a unified, elegant mathematical theory of agency exists. Intelligence may not be governed by clean differential equations; it might be a messy, highly contingent collection of heuristics-what researcher Rich Sutton famously termed "the bitter lesson" of scaling compute. If intelligence is fundamentally messy rather than mathematically pure, the a priori approach of Agent Foundations may be a decades-long pursuit of a phantom framework.

The author concedes this limitation, noting that in some hypothetical universes, the vague concept of fluxions points toward nothing, or there is no general theory to be found. The justification for continuing the research is not the certainty of success, but the severity of the threat and the perceived inadequacy of the alternatives.

The tension between prosaic alignment and Agent Foundations is not merely a disagreement over tactics, but a profound divergence in how the AI safety community believes knowledge about intelligence is acquired. The demand for empirical grounding will continue to dominate industry labs driven by product cycles and legible benchmarks. However, understanding the AF perspective requires recognizing that for some researchers, the current generation of AI models is simply a distraction from the fundamental mathematics required to survive the transition to superintelligence.

### Key Takeaways

*   The demand for empirical grounding in Agent Foundations research is criticized as a category error, akin to demanding Isaac Newton ground calculus in the study of cooling coffee.
*   Current large language models may represent an extremely narrow, idiosyncratic region of 'agency space,' making generalizations from them potentially useless for superintelligent systems.
*   The AI safety ecosystem is fractured between empirical engineering labs (e.g., Anthropic, OpenAI) and a priori mathematical researchers (e.g., MIRI).
*   A critical limitation of the Agent Foundations approach is the unproven assumption that a unified, formalizable mathematical theory of general agency actually exists.

---

## Sources

- https://www.lesswrong.com/posts/3QbyzrkqfoZwvSegu/attack-of-the-killer-differential-equations
