AdaLeZO

1 sources - 6 claims

AdaLeZO preserves peak memory while improving throughput relative to MeZO on OPT-6.7B. AdaLeZO reduces perturbation and update work from dense parameter cost to approximately proportional to the sampling ratio. AdaLeZO can wrap several other zeroth-order optimizers because it changes spatial allocation rather than the underlying optimizer family. AdaLeZO samples only a subset of layers at each step and generates Gaussian perturbations only for active layers. AdaLeZO adaptively selects layers for zeroth-order perturbations by treating layers as arms in a non-stationary multi-armed bandit problem. AdaLeZO concentrates a limited perturbation budget on layers estimated to be sensitive.