Zero-Variance Groups

In the Qwen2.5-7B ALFWorld setting, 39% of an offline 100-group sample was zero-variance. The on-policy run averaged around 40% zero-variance groups. A zero-variance group occurs when all trajectories in a group receive the same terminal r…

1 sources - 5 claims

In the Qwen2.5-7B ALFWorld setting, 39% of an offline 100-group sample was zero-variance. The on-policy run averaged around 40% zero-variance groups. A zero-variance group occurs when all trajectories in a group receive the same terminal reward. Removing zero-advantage groups is proposed to reduce batch dilution and increase effective update magnitude. Zero-variance groups add no policy-gradient signal while still consuming rollout and training compute.