New Prompting Method Doubles GPT-5.2's LongCoT Reasoning Performance to 65.6% Without Retraining

Image for New Prompting Method Doubles GPT-5.2's LongCoT Reasoning Performance to 65.6% Without Retraining

Alex Zhang, collaborating with Zhening Li and Omar Khattab, announced a significant breakthrough in large language model (LLM) performance on the challenging LongCoT benchmark. Their new approach boosted the Recursive Language Model (RLM) version of GPT-5.2's accuracy from 38.7% to an impressive 65.6% on LongCoT-mini, achieving this without any additional model training. This development supports their recently proposed "Mismanaged Geniuses Hypothesis," as stated by Alex Zhang in a recent social media post:

"we boost performance of RLM(GPT-5.2) to double the best performing number (38.7% --> 65.6%) on LongCoT-mini without any training!"

The LongCoT benchmark, introduced by researchers including Sumeet Ramesh Motwani and Charles London, evaluates frontier LLMs on their long-horizon compositional reasoning capabilities across diverse domains such as mathematics, computer science, and chemistry. Initial findings from the benchmark revealed that models, including GPT-5.2, struggled significantly, often achieving less than 10% accuracy in the primary reasoning-only setting. The original benchmark paper generally attributed these struggles to the RLMs' inability to perform complex task decomposition.

Zhang and his team, however, argue that the issue lies not in the models' inherent capabilities but in the methods used to interact with them. Their "Mismanaged Geniuses Hypothesis" posits that existing frontier LLMs are severely underutilized due to suboptimal prompting and management of individual language model calls. According to Zhang's post:

"The paper generally attributes this to the RLMs inability to perform task decomposition, but we argue this is more our fault in how we prompt them; this capability is fully available to GPT-5.2 with an RLM harness!" This substantial performance increase was achieved through "different, rather specific prompting" and an RLM harness, without any retraining of the base model.

This breakthrough, which builds on previous work by @raw_works and the LongCoT benchmark creators, highlights the potential of advanced prompting strategies and architectural frameworks like Recursive Language Models to unlock latent reasoning abilities in LLMs. The researchers emphasized that their results are presented not for direct comparison on the LongCoT leaderboard, but to demonstrate that existing models possess the underlying capabilities when effectively managed. This finding has significant implications for future LLM development, suggesting a shift in focus towards optimizing interaction and decomposition strategies rather than solely pursuing further model scaling.