Slow-Fast Policy Optimization

GXPO generally matched or exceeded SFPO while using fewer active-phase backward passes at larger K. In the model-group comparisons, GXPO improved over the strongest SFPO setting by 0.14 to 1.28 average pass@1 points. SFPO is treated as an…

1 sources - 5 claims

GXPO generally matched or exceeded SFPO while using fewer active-phase backward passes at larger K. In the model-group comparisons, GXPO improved over the strongest SFPO setting by 0.14 to 1.28 average pass@1 points. SFPO is treated as an optimizer-side lookahead method close to the setting studied by the paper. SFPO uses K fast inner steps and then applies a slow correction. SFPO costs K plus one backward passes per update.