AdamW

1 sources - 5 claims

AdamW remains hard to beat when it is strongly tuned and supported by mature implementations. AdamW performance depends on tuning choices and implementation details rather than being a single fixed baseline. AdamW avoids coupling regularization with adaptive preconditioning by applying weight decay separately from the adaptive denominator. AdamW's main limitation is the memory required to store optimizer state for every trainable parameter. AdamW remains the dominant baseline for contemporary LLM pretraining and fine-tuning because it combines adaptive moments with decoupled weight decay.