Binary-Reward RLVR

2 sources - 9 claims

Prefix Sampling increases update-bearing groups, reward entropy, contrastive structure, and RLOO advantage energy under the same rollout budget. Claims are limited to binary-reward RLVR with grouped rollouts. RLVR trains reasoning-focused language models by sampling candidate rollouts, scoring them with a verifier, and updating from within-group advantages. Binary-reward reinforcement learning signal is strongest when a rollout group contains substantial success-failure contrast. Rollout generation dominates GRPO-style RLVR cost because a step can generate hundreds of thousands of tokens. The paper addresses inefficiency in reinforcement learning with verifiable binary rewards. The paper positions pass-rate control as a practical efficiency objective for binary-reward RLVR. The core efficiency problem in RLVR is reducing generated-token cost without degrading the learning signal. Generated tokens are supported as the operational budget unit for RL post-training because decoding dominates wall-clock cost when architecture is fixed.