Group Relative Policy Optimization

6 sources - 26 claims

GXPO reached GRPO's peak accuracy threshold faster than GRPO on Llama3.2-3B in steps, time, and backward passes. Prefix Sampling can convert more rollout compute into usable GRPO updates by steering skewed groups toward 3/8-5/8. GRPO avoids using a learned value baseline. GRPO required one backward pass per training step in the reported comparison. Flat GRPO had substantially lower measured RIFB than InfoTree at training step 150. The matched baseline used a DeepSWE GRPO++ recipe with specific clipping, loss, filtering, and optimization settings. In the reported setup, rollout sampling consumed about 95% of wall-clock time. The paper's theoretical analysis is specific to the GRPO objective. GRPO has emerged as the dominant RL algorithm for VLA post-training because it eliminates the need for a separate learned critic. GRPO assigns a single trajectory-level advantage to every chunk in a rollout, causing actor-update compute to be spent uniformly across all trajectory phases including those the policy already handles correctly. GRPO's learning signal vanishes when rollout rewards are uniform, because the algorithm is driven by advantage variance across rollouts. GRPO trains a langua…