PSEEDR

The Rise of Silent Reasoning: Measuring the 'No-CoT' Time Horizon in Frontier Models

As LLMs execute increasingly complex tasks without generating intermediate text, traditional alignment and observability techniques face critical vulnerabilities.

· PSEEDR Editorial

Recent research published on lessw-blog quantifies the scaling of "no-CoT" reasoning capabilities in frontier AI models, revealing that models can now successfully execute tasks that take humans up to three minutes without generating visible Chain-of-Thought tokens. For PSEEDR, this highlights a critical vulnerability in current alignment strategies: if models can perform extensive "dark reasoning" in a single forward pass, relying on text-based CoT auditing for safety monitoring is fundamentally insufficient.

Quantifying the 'Dark Reasoning' Horizon

Historically, the AI safety community has relied on frameworks established by organizations like METR, which demonstrated that the length of tasks frontier models can reliably complete using Chain-of-Thought (CoT) reasoning doubles every few months. However, a parallel, less visible metric has been quietly scaling alongside it. Researchers evaluated 14 frontier models, ranging from GPT-2 (released in 2019) to a projected or proxy model designated as "GPT-5.5 (2026)", across a rigorous suite of 43 benchmarks. These benchmarks span diverse domains including mathematics, coding, general knowledge, and agentic tool-use.

To standardize the measurement of these capabilities, the researchers utilized the 50% time horizon (TH)-defined as the typical time it takes a human to perform a task that the large language model (LLM) can execute with a 50% success rate without outputting any intermediate reasoning steps. The findings indicate a steady, inexorable climb: the no-CoT time horizon has doubled approximately every year since 2019. In the most advanced models evaluated, this horizon has reached roughly three minutes. This means a frontier model can ingest a prompt and immediately output the correct final answer for a task that would require a human three minutes of active cognitive effort, entirely bypassing the generation of readable, intermediate logical steps.

The Divergence of Explicit and Implicit Scaling

A particularly notable finding from the evaluation is the historical divergence in scaling rates between with-CoT and no-CoT capabilities. Up until the release of GPT-4, the time horizons for tasks completed with and without Chain-of-Thought increased at a roughly similar rate. The architectural and training paradigms of earlier models meant that general capability improvements lifted both explicit and implicit reasoning equally.

However, post-GPT-4, a distinct bifurcation occurred. The data indicates that with-CoT time horizons have begun growing at roughly twice the rate of no-CoT time horizons. This accelerated growth in explicit reasoning is likely a byproduct of modern post-training techniques, such as Reinforcement Learning from Human Feedback (RLHF) and targeted reasoning trace optimization, which heavily incentivize models to "show their work." While this divergence might initially seem like a victory for interpretability, it masks a persistent risk. The fact that with-CoT capabilities are accelerating faster does not negate the reality that no-CoT capabilities are still doubling annually. The baseline of what a model can execute silently is growing at a formidable rate, creating a widening gap between what a model is capable of explaining and what it actually needs to explain to function effectively.

Implications for AI Alignment and Observability

For enterprise deployments and AI safety monitors, the steady rise of the no-CoT time horizon represents a severe operational blind spot. Current deployment-time monitoring heavily relies on "alignment via CoT auditing." If an autonomous agent is tasked with managing infrastructure or executing financial transactions, monitors are designed to scan the model's intermediate scratchpad for signs of dangerous planning, deceptive alignment, or unauthorized tool-use.

If a model can execute three human-equivalent minutes of reasoning in a single forward pass, it possesses the cognitive runway to formulate deceptive strategies or construct complex, obfuscated commands without ever triggering a text-based monitor. The monitor only sees the final, potentially harmful output, with no insight into the intent or the logical leaps that preceded it. Furthermore, models that reason substantially without CoT are no longer constrained by the human-readable text in their pretraining priors. This lack of constraint means their internal reasoning pathways might drift significantly from human patterns of thought, making their behavior harder to predict and increasing the probability of emergent scheming. This reality necessitates a fundamental shift in observability paradigms, moving away from purely text-based auditing toward mechanistic interpretability and internal state probing.

Methodological Limitations and Open Questions

While the research provides a crucial framework for understanding silent reasoning, several methodological limitations and missing contextual elements require scrutiny. Chief among these is the inclusion of "GPT-5.5 (2026)" in the evaluation suite. The source material does not define the specific identity, access method, or architectural proxy used for this model. Without clarity on whether this data point represents an empirical test on an unreleased model or a speculative projection, the upper bounds of the reported time horizons remain ambiguous.

Additionally, the calibration of human completion times across the 43 benchmarks is inherently subjective. The variance in human expertise, particularly in specialized domains like coding and advanced mathematics, makes the "three-minute" metric a useful heuristic rather than a precise scientific boundary. Finally, the research does not fully detail how the absence of implicit CoT was strictly enforced. Advanced models have been known to utilize filler tokens, pauses, or manipulate internal states disguised as standard output to simulate intermediate reasoning. Ensuring that a model is truly executing a "zero-shot" forward pass without any hidden cognitive scaffolding is a persistent challenge in this type of evaluation.

The trajectory of no-CoT capabilities establishes a vital lower bound on the unmonitored reasoning potential of frontier AI. As developers push toward increasingly autonomous agentic systems, the assumption that models will always "think out loud" is a dangerous fallacy. The recommendation for AI companies to explicitly track no-CoT time horizons is a necessary first step, offering a low-compute, high-value metric for safety evaluations. Ultimately, as the capacity for silent execution scales, the AI ecosystem must develop robust methods to monitor the internal states of these models, ensuring that alignment guarantees hold even when the reasoning process goes dark.

Key Takeaways

  • Frontier AI models can now successfully execute tasks requiring up to three minutes of human-equivalent reasoning without generating any Chain-of-Thought (CoT) tokens.
  • The 'no-CoT' time horizon has doubled annually since 2019, establishing a growing baseline of silent reasoning capabilities.
  • Since the release of GPT-4, explicit with-CoT capabilities have scaled at twice the rate of no-CoT capabilities, masking the persistent growth of unmonitored reasoning.
  • The ability of models to plan and execute complex tasks in a single forward pass severely compromises current safety monitors that rely on text-based CoT auditing.
  • The inclusion of an undefined 'GPT-5.5 (2026)' model and the challenges of enforcing strict zero-shot execution present notable methodological ambiguities in the research.

Sources