METR Time Horizons 1.1: Signals of Accelerating AI Capabilities
Coverage of lessw-blog
A recent report highlighted by lessw-blog details the release of METR's Time Horizons 1.1, revealing a significant compression in the doubling time of AI capability milestones.
In a recent post, lessw-blog discusses the release of Time Horizons 1.1 by METR (Model Evaluation and Threat Research). This update represents a critical data point in the ongoing effort to accurately benchmark artificial intelligence, moving beyond static question-answering tests to measure autonomous capabilities and task duration.
The Context: Why Benchmarking Matters
As Large Language Models (LLMs) continue to saturate traditional benchmarks like MMLU or GSM8K, the industry faces a measurement crisis. Distinguishing between a chatbot that can answer a trivia question and an agent that can autonomously execute complex, multi-step engineering tasks is vital for forecasting economic impact and safety risks. METR's "Time Horizons" benchmark is designed to address this by evaluating how long and how effectively a model can operate without human intervention.
The Signal: Acceleration is Increasing
The core analysis presented by lessw-blog focuses on the velocity of improvement. The data suggests that the rate of progress is not merely linear but accelerating. According to the report, the "50% time horizon doubling time"—a metric estimating how quickly the duration of tasks an AI can reliably perform doubles—has decreased significantly. In the previous version (1.0), this doubling period was calculated at 165 days. In the 1.1 update, this figure has compressed to 131 days.
This shift is significant for strategic forecasting. A shortening doubling time implies that the window between current capabilities and highly autonomous systems is shrinking faster than previous models anticipated. While hardware improvements (Moore's Law) contribute to this, the compression suggests that algorithmic efficiency and post-training enhancements are compounding these gains.
Model Performance
The update also tracks the performance of specific frontier models. The brief notes substantial improvements in top-tier systems, specifically citing performance metrics for "Claude 4.5 Opus" (as referenced in the source text), which reportedly improved its time horizon from roughly 4 hours and 49 minutes to over 5 hours and 20 minutes. Regardless of the specific model nomenclature used in the analysis, the trend indicates that leading models are becoming capable of sustaining coherent agency over longer periods.
Why This Was Overlooked
The post notes that this release was somewhat overshadowed by other industry news (referenced as "Moltbook"). However, for observers tracking the trajectory of Artificial General Intelligence (AGI), the slope of the capability curve provided by METR is likely a higher-fidelity signal than individual product launches. The data indicates that the ecosystem is moving deeper into the realm of long-horizon agency at an increasing pace.
We recommend reading the full post to understand the specific methodology changes in Time Horizons 1.1 and the implications for AI safety and development timelines.
Read the full post on LessWrong
Key Takeaways
- METR released Time Horizons 1.1, an update to their autonomous capability benchmark.
- The doubling time for the '50% time horizon' metric accelerated from 165 days to 131 days.
- This compression suggests AI agency is advancing faster than previously modeled.
- Top-tier models are showing measurable gains in their ability to handle longer, more complex tasks.