Beyond Chain-of-Thought: How Filler Tokens and Repetition Boost LLM Math Scores
Coverage of lessw-blog
In a compelling new analysis published on LessWrong, the author investigates a nuanced capability emerging in the latest generation of Large Language Models (LLMs): the ability to utilize "filler tokens" to improve mathematical reasoning without explicit Chain-of-Thought (CoT) prompting.
For years, a prevailing assumption in the field of generative AI has been that Large Language Models (LLMs) cannot "think" silently. Because these models generate output token-by-token in a forward pass, the consensus was that they require explicit tokens-such as a Chain-of-Thought (CoT) reasoning step-to perform intermediate computations. If a model answers immediately, it is presumed to be relying on rote memorization or heuristic pattern matching rather than calculation. A new post on lessw-blog challenges this assumption, presenting evidence that recent frontier models have developed the ability to leverage "filler tokens" or simple problem repetition to perform latent computation.
The Context: The Cost of Reasoning
Traditionally, improving an LLM's performance on complex tasks like mathematics required prompt engineering strategies that force the model to verbalize its steps. While effective, CoT increases latency and token costs. Researchers have long sought methods to induce "thinking" without the verbosity. Previous studies indicated that adding meaningless filler (such as a string of dots) did not help older models (pre-2024) process information better; they simply treated the filler as noise. This limitation suggested that the models were incapable of utilizing the extra inference time provided by those tokens for background processing.
The Signal: Latent Computation in Newer Models
The analysis provided by lessw-blog suggests a shift in this paradigm. The author demonstrates that recent high-capability models (referenced in the study with designations such as "Opus 4.5") show a measurable performance improvement when provided with filler tokens or when the problem statement is repeated.
The study highlights that simply allowing the model to output filler tokens before giving a final answer-without generating actual reasoning text-boosted performance on a competition math dataset from 45% to 51%. Even more notably, the author found that repeating the problem statement served as an even more reliable mechanism for performance gain. This implies that the model uses the processing time associated with the repeated tokens to refine its internal state, effectively "double-checking" its understanding before committing to an answer.
This finding is significant for developers and prompt engineers. It suggests that the computational capacity of modern LLMs is becoming more flexible. The ability to utilize non-reasoning tokens for computation indicates that future prompt strategies might achieve high accuracy without the overhead of long-winded CoT explanations.
Conclusion
This research offers a glimpse into the evolving internal mechanics of LLMs, suggesting that the line between "fast" (intuitive) and "slow" (deliberative) thinking in AI is becoming blurred. For a detailed breakdown of the experiments and the specific performance metrics observed.
Read the full post on LessWrong
Key Takeaways
- Evolution of Capability: Unlike models released prior to 2024, recent frontier LLMs can utilize meaningless "filler tokens" to perform background computation.
- Performance Boost: In the tested scenarios, using filler tokens improved math performance on a competition dataset from 45% to 51% without explicit Chain-of-Thought.
- The Repetition Hack: Repeating the problem statement proved to be a highly effective and reliable method for improving accuracy, particularly for weaker models.
- Latent Computation: The findings suggest that modern models possess a latent capacity to compute during token generation, even if the tokens themselves carry no semantic weight.