METR Analysis: Claude Opus 4.5 Extends Agentic Time Horizon

Coverage of lessw-blog

ยท PSEEDR Editorial

A recent report highlights a significant shift in how long AI agents can operate effectively, with Claude Opus 4.5 achieving a new benchmark for sustained task execution.

In a recent post, lessw-blog discusses new findings regarding the agentic capabilities of Claude Opus 4.5, specifically focusing on metrics provided by METR (Model Evaluation and Threat Research). As the industry shifts focus from simple chatbots to autonomous agents capable of executing multi-step workflows, the definition of performance is evolving. It is no longer sufficient to measure accuracy on a single prompt; evaluators must now determine how long a model can maintain coherence and effective reasoning before the probability of failure becomes too high.

This topic is critical because the utility of AI is currently bottlenecked by reliability over time. In agentic workflows-such as coding an entire application or conducting open-ended research-errors compound. A model that fails after 30 minutes is useful for assistance; a model that can persist for 5 hours is potentially autonomous. METR's rigorous approach to quantifying this "time to failure" provides a more realistic view of deployment readiness than standard academic benchmarks.

The analysis centers on the concept of "time horizon"-a metric estimating the duration a model can operate while maintaining a specific probability of success. According to the data presented, Claude Opus 4.5 has achieved a 50%-time horizon of approximately 4 hours and 49 minutes. This figure represents the highest time horizon METR has published to date, suggesting a substantial improvement in the model's ability to navigate extended, complex tasks compared to its predecessors.

However, the report offers critical nuance regarding reliability. While the 50% horizon is impressive, the 80%-time horizon-the duration where the model is highly likely to succeed-sits at just 27 minutes. This is comparable to other leading models, such as the referenced GPT-5.1-Codex-Max, which holds a 32-minute benchmark in this category. The disparity between the 50% and 80% figures indicates a "flatter logistic success curve" for Opus 4.5. In practical terms, this suggests that while Opus 4.5 is not necessarily more reliable on short tasks than its peers, it degrades much more slowly as task length increases, allowing it to attempt significantly longer operations with a fighting chance of success.

The post also notes limitations in the current evaluation suite. The upper confidence interval for the 50% metric extends beyond 20 hours, a figure METR suggests is likely inflated due to a scarcity of sufficiently long tasks in their current testing battery. As METR works to update its task suite, these numbers serve as a preliminary but promising indicator that frontier models are beginning to bridge the gap between brief interactions and prolonged, autonomous work.

For those tracking the trajectory of AI agents, this analysis provides essential data points on the trade-offs between peak capability and sustained reliability.

Read the full post at LessWrong

Key Takeaways

Read the original post at lessw-blog

Sources