Zeroth-Order Optimization
Full-parameter ZO perturbation is not necessarily more accurate because variance scales with dimension. On OPT-6.7B, perturbation and update work account for nearly half of total MeZO step time. Zeroth-order optimization avoids backpropaga…
1 sources - 6 claims
Full-parameter ZO perturbation is not necessarily more accurate because variance scales with dimension. On OPT-6.7B, perturbation and update work account for nearly half of total MeZO step time. Zeroth-order optimization avoids backpropagation by estimating gradients from forward-pass finite differences. Dense Gaussian perturbations force standard ZO methods to touch every parameter at each step. The basic symmetric ZO estimator evaluates the loss at positively and negatively perturbed parameters and multiplies the finite difference by the perturbation vector. Standard zeroth-order optimization is slow and noisy for billion-parameter models.