PSEEDR

Galaxy-Brained Model-Chat: ASI Constitutions and the Cosmic Host

Coverage of lessw-blog

· PSEEDR Editorial

A recent analysis from lessw-blog explores the steerability of frontier language models toward complex ethical frameworks, revealing stark differences in how models like Gemini, Claude, and GPT respond to constitutional prompting.

In a recent post, lessw-blog discusses the intricate challenge of steering large language models (LLMs) toward complex, advanced ethical frameworks. Specifically, the author investigates how different frontier models respond to in-context constitutional prompting designed to align them with Nick Bostrom's concept of the 'cosmic host'-a theoretical, universal ethical stance relevant to Artificial Superintelligence (ASI).

As AI systems scale rapidly toward superintelligence, the question of how to instill robust, reliable values becomes critical. Traditional alignment often relies on 'HHH' (Helpful, Honest, Harmless) principles, which tend to be human-localist. However, advanced systems may require broader decision-theoretic structures to navigate cosmic-scale ethical dilemmas. Understanding whether current foundation models can adopt these expansive frameworks through prompting alone offers a vital window into their underlying architectures and default behavioral attractors.

The core of lessw-blog's analysis centers on an exploratory evaluation of how models from Google, Anthropic, and OpenAI handle these 'galaxy-brained' constitutions. Through a 30-scenario evaluation and qualitative transcript analysis, the author presents a fascinating divergence in model behavior. Gemini emerges as uniquely steerable among closed frontier models. It successfully adopts the decision-theoretic structure of the constitution, even when explicit 'cosmic' or HHH language is stripped away. Conversely, models from Anthropic and OpenAI show significant resistance to this specific steering. Instead of adopting the prompted cosmic host framework, they default to their family-specific attractors-Anthropic models lean heavily into human-localist ethics, while OpenAI models gravitate toward suffering-focused frameworks.

While the author notes that this work is exploratory and comes with caveats regarding low sample sizes and the difficulty of distinguishing genuine reasoning from mere pattern-matching, the implications are substantial. The findings suggest that the choice of foundation model and the specific methods of constitutional guidance will play a pivotal role in shaping the values of future AI systems. The resistance observed in some models indicates that in-context prompting may not be sufficient to override deeply ingrained training attractors when dealing with highly abstract ASI constitutions.

For researchers and practitioners focused on AI safety, control, and alignment, this piece offers a compelling look at the current boundaries of model steerability. Read the full post to review the qualitative transcripts and the complete breakdown of the 30-scenario evaluations.

Key Takeaways

  • LLMs can be steered toward complex ethical frameworks like Bostrom's cosmic host using in-context constitutional prompting.
  • Google's Gemini demonstrates unique steerability, responding to decision-theoretic structures even when explicit ethical language is removed.
  • Anthropic and OpenAI models resist cosmic steering, defaulting to human-localist and suffering-focused attractors, respectively.
  • The research highlights critical differences in foundation model architectures and their inherent alignment attractors.
  • Findings are exploratory, emphasizing the need for further research into whether models are genuinely reasoning or simply pattern-matching.

Read the original post at lessw-blog

Sources