Reinforcement Learning Post-Training

1 sources - 5 claims

All experiments use binary verifiable rewards from a rule-based answer-matching checker, and whether the design transfers to continuous or noisy reward settings is an open question. Reinforcement learning post-training with verifiable rewards has become the dominant approach for improving mathematical reasoning in large language models. All LZE experiments use the Verl framework with GRPO/DAPO training, 8 rollouts per prompt, and KL-divergence penalty removed. GRPO and DAPO generate multiple rollouts per prompt, aggregate rewards within each group, and update the policy, but allocate compute nearly uniformly across all prompts. When a prompt's pass rate approaches 1 or 0, group-relative advantages collapse to zero and no gradient contribution is produced, wasting both extremes of compute allocation.