A recent technical deep dive has revealed that a common precision mismatch in Reinforcement Learning from Human Feedback (RLHF) training can lead to a significant issue dubbed "phantom clipping," affecting approximately 18% of tokens and causing training to stall. The findings, detailed in a paper by Penghui Qi, Zichen Liu, and others, and highlighted by Thomas Wolf, point to the discrepancy between FP32 training passes and BF16 inference engines, such as vLLM, as the root cause.
The problem surfaced during a sanity check of the TRL library's newly integrated AsyncGRPO, designed to decouple inference and training for faster scaling. Researchers observed that a trivial setup failed to converge, leading to an investigation into this known but poorly understood issue. According to Thomas Wolf, a prominent figure in the AI community, "Nobody had pinpointed the actual mechanism. We did in this deep dive by @DirhousssiAmine."
The core mechanism involves the importance sampling ratio in Proximal Policy Optimization (PPO), which is decomposed into log r = α + β. Here, α represents the true policy change, and β signifies the precision gap between the FP32 training forward pass and a BF16 forward pass on the same weights. While β is numerically small at the token level, it is not random noise; it exhibits a consistent negative bias and correlates with the advantage.
The critical discovery is that training failures occur when this β component enters PPO's clipped objective. Because PPO clips the ratio, even small numerical perturbations from β can push the ratio outside the trust region, leading to the selection of the clipped branch and a zero gradient. This "phantom clipping" means tokens are treated as if they exceeded the trust region due to purely numerical differences, not actual policy changes.
The impact is substantial: at early training stages, when the policy has barely moved, roughly 18% of tokens are phantom-clipped. This compounds over time, preventing the deployed policy from improving and locking the system into a permanent stall. Researchers confirmed this causality with targeted interventions, showing that removing β from the ratio or disabling clipping restored convergence.
The study concludes that the issue is not general numerical noise but a specific interaction between precision mismatch and PPO's clipping. Recommended fixes include matching precisions (e.g., FP16 everywhere), computing the ratio from a BF16 shadow forward pass, or widening the clipping epsilon. The full write-up, with extensive experiments and interactive explanations, is available via the provided link.