Dynamic Gradient Gating
DGG achieved rollout and wall-clock speedups while matching the single-use baseline's converged performance. DGG often also improves final performance modestly beyond the single-use baseline. The recommended DGG defaults are maximum reuse…
1 sources - 6 claims
DGG achieved rollout and wall-clock speedups while matching the single-use baseline's converged performance. DGG often also improves final performance modestly beyond the single-use baseline. The recommended DGG defaults are maximum reuse K equal to 4 and threshold tau in the range 0.1 to 0.5. DGG monitors lm_head gradient energy and its step-wise increment. DGG activates gating only after the sliding window is populated and the current reuse index is greater than one. When DGG detects an excessive Z-score, it zeros gradients and ends reuse before Adam updates the optimizer state.