Gumbel Straight-Through Estimation

More samples per step were favored over more steps in compute-matched comparisons. Temperature is annealed exponentially from 1.0 to 0.01 in the described training procedure. The full convergence behavior with Adam, Gumbel noise, and tempe…

1 sources - 5 claims

More samples per step were favored over more steps in compute-matched comparisons. Temperature is annealed exponentially from 1.0 to 0.01 in the described training procedure. The full convergence behavior with Adam, Gumbel noise, and temperature annealing is empirical rather than proven. In Qwen3-8B ablations, Gumbel sample count was the most important hyperparameter. RCO samples Gumbel noise, perturbs logits, and averages multiple samples per step to reduce variance.