Memory-Efficient Optimizers

1 sources - 5 claims

Memory-efficient optimizers may be more valuable for enabling larger models, longer contexts, larger microbatches, or full-parameter fine-tuning than for lowering loss on an unchanged model. Reducing optimizer state can change which training regimes are feasible under a given hardware budget. Adafactor reduces second-moment memory for matrix parameters by replacing the full matrix with row and column accumulators. Memory-efficient optimizers reduce optimizer-state costs using factorization, quantization, grouping, projection, or fused updates. The measured memory advantage of memory-efficient methods can change under sharding, tensor parallelism, and custom kernels.