Online Data Selection

RAFT underperforms even the full RL baseline because retaining only positively rewarded rollouts for supervised fine-tuning sacrifices the policy-gradient-driven exploration essential for generalization. Uniform sampling at the same 40% re…

1 sources - 6 claims

RAFT underperforms even the full RL baseline because retaining only positively rewarded rollouts for supervised fine-tuning sacrifices the policy-gradient-driven exploration essential for generalization. Uniform sampling at the same 40% retention rate consistently underperforms the full-data baseline, confirming that LZE's gains come from principled selection rather than reduced data volume. Performance peaks at a selection ratio κ=0.4; lower ratios starve the policy gradient, while higher ratios allow all-correct and all-incorrect prompts to dilute the gradient signal. Reinforce-Rej applies binary keep-or-drop decisions when prompts become all-correct or all-incorrect, discarding the continuous gradient signal and potentially destabilizing the training distribution. Offline filtering methods are inherently static and become off-policy as the model's capabilities evolve and the effective difficulty of each prompt shifts. The DPS method predicts prompt solvability via a hidden Markov model but requires maintaining a learned generative state model with non-trivial computational overhead.