Prefix Sampling
Prefix tokens are masked out of the reinforcement-learning loss so replayed off-policy actions are not credited by the new rollout. Prefix Sampling is a bidirectional controller for rollout groups of size eight. Prefix length is controlled…
1 sources - 6 claims
Prefix tokens are masked out of the reinforcement-learning loss so replayed off-policy actions are not credited by the new rollout. Prefix Sampling is a bidirectional controller for rollout groups of size eight. Prefix length is controlled separately for the 1/8, 2/8, 6/8, and 7/8 buckets. The adaptive controller maintains an exponential moving average of rerollout pass rate for each bucket. Prefix Sampling filters degenerate groups, trains directly on balanced groups, and replays saved successful or failing prefixes for skewed groups. Hard prefixes are intended to raise rerollout pass rates, while easy prefixes are intended to lower them.