Large-Batch Optimizers

Increasing global batch size can improve accelerator utilization but eventually gives diminishing optimization returns. A larger batch can increase tokens per second while reducing token efficiency. LAMB combines Adam-style adaptivity with…

1 sources - 5 claims

Increasing global batch size can improve accelerator utilization but eventually gives diminishing optimization returns. A larger batch can increase tokens per second while reducing token efficiency. LAMB combines Adam-style adaptivity with layer-wise trust ratios for very large-batch training. Large-batch optimizers address the connection between optimization behavior and hardware throughput. Large-batch methods are most relevant when training is communication-limited or throughput improves meaningfully with larger batches.