Sample Reuse

1 sources - 4 claims

Under naive sample reuse, training first converges faster than single-use training and then degrades severely. Fixed reuse has a stability-efficiency trade-off where low reuse is stable but less efficient and high reuse can collapse. Naive sample reuse in LLM RLVR is said to cause catastrophic training collapse. Sample reuse means applying multiple gradient updates to each newly generated rollout batch.