Large-Batch Optimizers
Increasing global batch size can improve accelerator utilization but eventually gives diminishing optimization returns. A larger batch can increase tokens per second while reducing token efficiency. LAMB combines Adam-style adaptivity with…
1 sources - 5 claims
Increasing global batch size can improve accelerator utilization but eventually gives diminishing optimization returns. A larger batch can increase tokens per second while reducing token efficiency. LAMB combines Adam-style adaptivity with layer-wise trust ratios for very large-batch training. Large-batch optimizers address the connection between optimization behavior and hardware throughput. Large-batch methods are most relevant when training is communication-limited or throughput improves meaningfully with larger batches.