Is METR Underestimating LLM Time Horizons?

A new analysis challenges standard forecasting metrics, proposing that human-relative benchmarks reveal a much faster rate of AI progress than currently assumed.

In a recent analysis on LessWrong, the author scrutinizes the methodology used by METR (Model Evaluation and Threat Research) to forecast Large Language Model (LLM) capabilities. As the industry relies heavily on benchmarks to predict when systems might reach human-level agency, the specific metrics chosen to track progress are critical. The post argues that current fixed-threshold measurements might be masking the true velocity of improvement, potentially delaying our understanding of when high-stakes capabilities will emerge.

Forecasting the trajectory of Artificial General Intelligence (AGI) is notoriously difficult, often relying on extrapolating performance on specific evaluations. METR is widely considered a gold standard for measuring "effective compute" and task horizons. However, this analysis suggests that METR's reliance on fixed reliability targets may be conservative. The author proposes an alternative lens: a "human-relative" metric. Instead of asking how long a model can operate at an arbitrary reliability percentage, this metric asks for the longest time horizon over which an LLM exceeds human baseline reliability.

The distinction is subtle but vital. Humans are imperfect agents; therefore, holding AI to a static, high-reliability standard might obscure how quickly it is becoming more reliable than a human. When applying this human-relative lens, the growth trend appears significantly steeper. The analysis indicates that while METR's standard trend shows a doubling of effective horizon every 6.8 months, the proposed metric indicates a doubling every 1.9 months. This represents a massive acceleration in the perceived rate of progress.

This quantitative discrepancy has profound implications for forecasting. If the faster rate holds true, the timeline for achieving human-level horizons could converge around 2026 or 2027. The author notes that while "super-exponential" growth models-often associated with aggressive timelines like the AI-2027 thesis-are not strongly supported by standard METR data, they align much more closely with this proposed human-relative metric. This suggests that comparing AI performance directly to imperfect human baselines may offer a stronger signal of functional agency than abstract reliability scores.

While the author acknowledges substantial uncertainty and noise in the data, particularly regarding the definition of baselines, the argument highlights the sensitivity of our forecasts to methodology. For researchers and strategists, this serves as a reminder that the definition of "success" in benchmarking can radically alter the predicted arrival time of transformative capabilities.

We recommend reading the full post to examine the graphs and the specific mathematical arguments regarding logistic versus super-exponential curves.

Read the full post on LessWrong

Key Takeaways

The post proposes a new 'human-relative' metric for measuring LLM progress, contrasting it with METR's fixed reliability targets.
Under this new metric, LLM time horizons appear to double every 1.9 months, compared to the 6.8 months suggested by standard METR trends.
This accelerated trend supports the possibility of reaching human-level horizons around 2026-2027.
The analysis suggests that 'super-exponential' growth models fit the human-relative data better than they fit standard benchmarks.
Comparing AI against imperfect human baselines may provide a more realistic view of functional agency than absolute reliability thresholds.

Read the original post at lessw-blog

Key Takeaways

Sources