Gradient Extrapolation-Based Policy Optimization

1 sources - 6 claims

GXPO keeps the same rollout batch, rewards, advantages, KL regularization, and GRPO loss while changing only the parameter update. GXPO was configured with alpha_0 of 0.5, delta of 1e-8, tau of 0.5, and trajectory-aware shutoff in the Qwen2.5-7B setup. GXPO is introduced as a change to the policy update rule rather than a replacement for the broader reinforcement learning pipeline. GXPO replaces a single GRPO update with a three-backward-pass active-phase update. GXPO computes a real corrective gradient after repositioning partway toward the extrapolated point. GXPO estimates a virtual K-step policy point by scaling an observed two-step displacement with a geometric-sum ratio.