Curvature-Aware Optimizers
Curvature methods must be evaluated on both token efficiency and wall-clock efficiency because their steps can be more expensive. Curvature-aware optimizers attempt to improve update geometry by approximating second-order information with…
1 sources - 5 claims
Curvature methods must be evaluated on both token efficiency and wall-clock efficiency because their steps can be more expensive. Curvature-aware optimizers attempt to improve update geometry by approximating second-order information with practical structure. Shampoo uses matrix statistics and inverse matrix roots to provide richer preconditioning than diagonal adaptivity. Sophia uses lightweight diagonal Hessian-like curvature estimates for language-model pretraining. Curvature-aware optimizers add computation and hyperparameters through inverse roots, Hessian estimates, damping, clipping, and schedules.