Empirical Evaluation

8 sources - 43 claims

For frozen n=500 timing, the full mean was 15,502.8 ms and the sampled mean was 1,553.6 ms. RASP-Tuner achieved the lowest mean terminal regret on all three real-world tabular streams. Across nine non-adversarial synthetic tasks, terminal paired tests favored RASP-Tuner over the stronger of GP-UCB and CMA-ES in seven tasks at p < 0.05. The sampled metric tracked full SND during nonstationary training and achieved the predicted 10x metric-call speedup. GXPO consistently improved average sampled pass@1 over GRPO in the reported experiments. Across reported configurations, GXPO improved average results over GRPO by 1.65 to 5.00 points depending on the model group. In controlled high-variability queue experiments, FedQueue had the fastest time-to-target and lowest data movement ratio and local-step count. At 50% budget, DUET exceeded all full-budget baselines on MATH-500 and ran 2.51 times faster than full-budget GRPO. On Qwen3-1.7B, full-budget DUET outperformed full-budget GRPO and budget-aware baselines on most benchmarks while running 1.62 times faster than GRPO. DUET’s MATH-500 advantage over GRPO increased as the budget tightened from full to quarter budget. DUET transferred to…