Layer Sensitivity

Accumulated AdaLeZO sampling probabilities increasingly align with Adam gradient norms over training. The raw ZO magnitude proxy has positive empirical correlation with Adam gradient norms. AdaLeZO tracks selected-layer rewards using an ex…

1 sources - 6 claims

Accumulated AdaLeZO sampling probabilities increasingly align with Adam gradient norms over training. The raw ZO magnitude proxy has positive empirical correlation with Adam gradient norms. AdaLeZO tracks selected-layer rewards using an exponential moving average. AdaLeZO uses the absolute scalar finite-difference magnitude as a proxy for layer sensitivity. AdaLeZO learns layer sensitivity over time from noisy scalar feedback. The multi-armed bandit acts as a temporal denoiser by aggregating noisy rewards into layer value estimates.