Matrix-Based Optimizers

Muon differs from AdamW, Lion, GaLore, and Shampoo by focusing on matrix-level update geometry rather than coordinate scaling, sign normalization, low-rank memory compression, or curvature preconditioning. The theoretical basis for matrix-…

1 sources - 6 claims

Muon differs from AdamW, Lion, GaLore, and Shampoo by focusing on matrix-level update geometry rather than coordinate scaling, sign normalization, low-rank memory compression, or curvature preconditioning. The theoretical basis for matrix-based optimizers remains incomplete, including questions about orthogonalized update geometry and shard-level orthogonalization. Matrix-based optimizers treat Transformer weights as matrices rather than as independent scalar coordinates. Muon forms a momentum matrix and applies approximate orthogonalization before updating weights. Matrix-based updates transform an update matrix using operations such as normalization, whitening, or orthogonalization. Matrix-based optimizers usually require hybrid parameter grouping because biases, normalization parameters, scalar parameters, and small tensors are less natural candidates.