Claude Plays Pokemon: A Benchmark for LLM Agency

In a recent update on LessWrong, the "Claude Plays Pokemon" project provides a fascinating look at the capabilities of Claude Opus 4.5, using the classic game as a rigorous test for sequential decision-making and spatial reasoning.

In a recent post on LessWrong, the author updates the community on the "Claude Plays Pokemon" initiative, specifically focusing on the performance of Claude Opus 4.5. While the premise involves a classic Game Boy title, the implications extend far beyond gaming. This project serves as a stress test for Large Language Model (LLM) agency, evaluating how well a model can navigate a complex, state-dependent environment over hundreds of thousands of steps.

The broader context here is the difficulty of evaluating "general" intelligence. Standard benchmarks often consist of static questions that test knowledge retrieval or code generation in isolation. They rarely test an AI's ability to maintain a coherent "thread" of agency over extended periods of operation. Games like Pokemon Red require the agent to remember where it has been, manage limited resources (inventory), and solve spatial puzzles that change the state of the world. This makes them an ideal sandbox for observing how models handle frustration, backtracking, and multi-step logic.

The author argues that this specific project is a superior test of underlying LLM progress compared to other agentic frameworks. Many "AI plays game" projects rely heavily on "scaffolding"—external code that hard-codes strategies or heavily guides the AI's decision-making. In contrast, this project aims to test the raw intelligence of the model itself with minimal external assistance. The goal is to see if the model can handle spatial reasoning and long-term objectives purely through its own context processing.

According to the update, Claude Opus 4.5 has made substantial strides. Since the previous report, the AI has successfully navigated notorious in-game challenges such as the teleportation mazes of Silph Co., the step-limited Safari Zone, and the Pokemon Mansion. Having acquired all eight gym badges, the model has demonstrated improved capabilities in vision, spatial awareness, and context window utilization. It is also better at detecting when it is stuck in a loop, a common failure mode for LLMs where they repeat actions indefinitely.

However, the experiment also highlights the current "ceiling" of these capabilities. The model is currently stalled at Victory Road, specifically struggling with boulder puzzles. These puzzles require pushing objects into specific locations to open paths, a task that demands rigorous forward planning and spatial consistency—areas where the model still falters. Despite 230,000 steps, the combination of long-term planning and precise interaction remains a hurdle. The author notes that while the model's vision and context handling have improved, it still requires external notes to maintain focus on long-term goals, suggesting that "raw" agency still has significant gaps to close before it can operate completely autonomously in complex environments.

For those interested in the intersection of gaming and AI evaluation, this post offers a clear view of where current models excel and where they hit hard limits.

Read the full post

Key Takeaways

The 'ClaudePlaysPokemon' project serves as a high-fidelity benchmark for unassisted LLM agency, minimizing external scaffolding.
Claude Opus 4.5 has successfully navigated complex game areas like Silph Co. and the Safari Zone, acquiring all eight badges.
The model demonstrates improved vision, spatial awareness, and loop detection compared to previous iterations.
A current bottleneck exists at Victory Road's boulder puzzles, indicating persistent struggles with multi-step physical puzzle solving.
Long-term planning remains a challenge, necessitating external notes for the model to maintain focus over thousands of steps.

Read the original post at lessw-blog

Key Takeaways

Sources