Selective Rollout

The gate saves both post-K rollout generation and training compute for groups it removes. The gate has negligible computational overhead because it uses a small number of short Levenshtein computations. In the main online experiments, the…

1 sources - 5 claims

The gate saves both post-K rollout generation and training compute for groups it removes. The gate has negligible computational overhead because it uses a small number of short Levenshtein computations. In the main online experiments, the gate used K = 10 and dL = 0.12. When the gate fires, running trajectories are stopped at step K and the group is excluded from GRPO loss. Selective rollout uses a one-parameter mid-rollout gate to decide whether to stop a group early.