Designing the Unsolvable: The Window for AI-Resistant Puzzles

In a recent post, lessw-blog explores the fleeting opportunity to design complex reasoning scenarios that still challenge modern AI capabilities.

In a recent post, lessw-blog discusses a personal strategic dilemma that highlights a broader issue in artificial intelligence evaluation: the shrinking window for creating human-generated tasks that can stump modern models. The author is currently deciding between two significant creative projects for the upcoming year: developing "Epistemic Roguelikes" (video games focused on knowledge acquisition) or writing new "D&D.Sci" scenarios. While framed as a personal choice, the analysis offers a compelling look at the current state of AI reasoning capabilities.

The core of the discussion revolves around D&D.Sci, a niche genre of tabletop roleplaying games where players act as scientists in a fantasy world. Instead of standard combat, the challenge involves analyzing raw data-such as biological surveys or magical combat logs-to inductively deduce the underlying mathematical rules of the universe. This requires deep, multi-step epistemic reasoning, a domain where human cognition often still outperforms machine learning.

The author argues that we are currently in a "now-or-never" window. Conventional AI models, despite their rapid advancement, are currently "flummoxed" by the specific demands of D&D.Sci scenarios. These puzzles require synthesizing novel rules from small datasets without relying on memorized patterns from training data. However, the author anticipates that this capability gap is temporary. As AI capabilities advance, the ability for a human to design a puzzle that is solvable by humans but unsolvable by AI will likely vanish.

This creates a unique urgency. If these scenarios are written now, they serve as valid benchmarks-proof of a specific type of reasoning that distinguishes current human intelligence from artificial intelligence. If the author waits, the opportunity to demonstrate this "AI-proof" design space may be lost as models improve to the point where they can trivialize such inductive tasks. The post suggests that publicly demonstrating the ability to create these tests is valuable specifically because the window to do so is closing.

For developers and researchers interested in the limits of LLM reasoning, this post provides a unique perspective on benchmarking. It moves beyond standard coding tests or standardized exams and looks at the gamification of the scientific method itself as the ultimate test of intelligence.

To understand the specific mechanics of these scenarios and the author's reasoning on project prioritization, we recommend reading the full entry.

Read the full post at LessWrong

Key Takeaways

The author is prioritizing projects based on their ability to challenge current AI capabilities.
D&D.Sci scenarios act as benchmarks for inductive reasoning, requiring players to deduce rules from novel data.
There is a perceived 'now-or-never' window to create puzzles that humans can solve but current AIs cannot.
Publicly demonstrating these AI-resistant tasks is valuable for documenting the current boundary between human and machine reasoning.

Read the original post at lessw-blog

Key Takeaways

Sources