Sign-Based Optimizers
Lion can avoid AdamW's second-moment vector by using one momentum state and decoupled weight decay. Lion uses a momentum-like exponential moving average and updates using the sign of a current-gradient and momentum combination. signSGD upd…
1 sources - 5 claims
Lion can avoid AdamW's second-moment vector by using one momentum state and decoupled weight decay. Lion uses a momentum-like exponential moving average and updates using the sign of a current-gradient and momentum combination. signSGD updates using the sign of the gradient and can support one-bit communication or majority-vote aggregation. Sign-based methods challenge whether LLM optimizers need full gradient magnitudes. Lion's sign-normalized updates make learning rate and weight decay especially important.