Importance-Corrected GRPO

The unbiasedness result applies formally under an action-independent baseline, while the practical GRPO implementation has residual coupling from group-normalized advantages. Empirically, the pessimistic importance-sampling surcharge staye…

1 sources - 5 claims

The unbiasedness result applies formally under an action-independent baseline, while the practical GRPO implementation has residual coupling from group-normalized advantages. Empirically, the pessimistic importance-sampling surcharge stayed near 1x and below 1.5x in reported runs. Aborted rollouts have advantages, returns, and response masks zeroed. DUET applies importance-corrected, gradient-masked GRPO during the update phase.