TOPReward Achieves 0.947 Value-Order Correlation in Zero-Shot Robotic Task Progress Estimation

A new method named TOPReward is poised to significantly advance real-world reinforcement learning (RL) by addressing the long-standing challenge of designing effective reward functions for robots. Developed by researchers Shirui Chen and Cole Schafroth, TOPReward leverages Vision-Language Models (VLMs) to generate highly generalizable reward signals, achieving a 0.947 mean Value-Order Correlation (VOC) on the Qwen3-VL model across over 130 distinct real-world tasks. This innovative approach was recently highlighted on RoboPapers Episode #75.

The core of reinforcement learning for robotics has long been hampered by the difficulty in crafting robust and generalizable reward functions. Traditional methods often require extensive human effort and domain-specific expertise, limiting scalability and adaptability to new tasks. The tweet from RoboPapers underscored this challenge, stating, > "Reinforcement on robots is highly limited by our ability to design good reward functions; this means that designing strong, generalizable reward functions is a key enabler to progress on real-world reinforcement learning."

TOPReward tackles this by utilizing the inherent capabilities of large VLMs. Instead of requiring VLMs to output explicit numerical progress values, which can be prone to misrepresentation, TOPReward extracts task progress directly from the model's internal token logits. Specifically, it generates rewards from the probability of the "True" token in a VLM's question-answering response, effectively asking the VLM to judge task completion. The RoboPapers tweet explained, > "TOPReward directly generates rewards from the probability of the 'True' token of a VLM question-answering response; this makes it easy to implement, incredibly general, and surprisingly powerful."

This novel, probabilistically grounded temporal value function demonstrates remarkable efficacy in zero-shot evaluations, meaning it operates without any additional training or fine-tuning for specific tasks. The method has shown consistent improvements in success rates for real-world manipulation tasks across various robot platforms, including Franka, YAM, and SO-100/101. Shirui Chen, a PhD student at UC Berkeley, and Cole Schafroth, a Research Scientist at Google DeepMind, are recognized as co-creators of this groundbreaking work.

The researchers discussed their paper, "TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics" (arXiv:2602.19313), with hosts Chris Paxton and Jiafei Duan on RoboPapers Episode #75. This discussion provided further insights into how TOPReward's logit-based approach successfully taps into the implicit progress estimation capabilities of open-source VLMs, offering a versatile tool for downstream applications such as success detection and reward-aligned behavior cloning.