Theoretical Analysis

A diagonal-quadratic sanity check shows that idealized GXPO can land at the same point as K plus one plain-GD steps with only three backward passes. The theoretical analysis uses a plain-gradient-descent surrogate to explain extrapolation…

2 sources - 11 claims

A diagonal-quadratic sanity check shows that idealized GXPO can land at the same point as K plus one plain-GD steps with only three backward passes. The theoretical analysis uses a plain-gradient-descent surrogate to explain extrapolation geometry, not AdamW state dynamics. The theoretical analysis assumes smoothness, bounded gradients, and a lower-bounded objective. Without clipping, AdaLeZO is unbiased for the full smoothed gradient under sampling with replacement. The surrogate theory explains exactness for diagonal quadratic losses and identifies failure modes from coupling, ratio error, inactive coordinates, and Taylor remainder. The convergence result gives an O(1/sqrt(T)) rate for the smoothed objective and adds a smoothing error term for the original objective. For diagonal Hessians, each gradient coordinate follows exact geometric decay. The smoothed gradient approximates the true gradient with an error bound depending on the smoothing scale, dimension, and smoothness constant. Under a fixed local quadratic Hessian, plain gradient descent produces gradients following repeated multiplication by I minus eta times the Hessian. The variance is minimized when sampling probabil…