Baselines

The evaluated baselines included Random Search, CMA-ES, sliding-window GP-UCB Bayesian Optimization, and NoMemory-RASP. Contextual GP-UCB can model objectives over parameters and context but exact GP inference scales cubically in observati…

2 sources - 10 claims

The evaluated baselines included Random Search, CMA-ES, sliding-window GP-UCB Bayesian Optimization, and NoMemory-RASP. Contextual GP-UCB can model objectives over parameters and context but exact GP inference scales cubically in observation count. CMA-ES is characterized as strong local search with low per-step cost but without explicit context-solution associations. The paper’s comparison with contextual GP-UCB emphasizes higher tested step latency from inference and acquisition optimization. The paper’s comparison with CMA-ES emphasizes that CMA-ES state is not explicitly indexed by context. Prior efficient-RLVR methods usually control either prompt or rollout selection, or within-rollout pruning, rather than both count and length jointly. DAPO post-filters after rollouts are generated, so it does not save the main generation cost. Fixed length caps are treated as suboptimal because they can penalize unfinished coherent reasoning and remove useful long reasoning chains. VIP adapts rollout allocation but treats rollout length as exogenous. ARRoL saved only a small slice of generated tokens in the math setting because few rollouts reached its inspection point.