Linear Attention
Speedup from replacing softmax attention with linear recurrence grows with video length. Headwise gating with 18K parameters per layer achieves the best quality-efficiency tradeoff among gating granularities tested. The learned headwise ga…
1 sources - 5 claims
Speedup from replacing softmax attention with linear recurrence grows with video length. Headwise gating with 18K parameters per layer achieves the best quality-efficiency tradeoff among gating granularities tested. The learned headwise gate values remain near 0.5, confirming that both the softmax intra-frame and linear inter-frame branches remain active as a content-dependent mixture. Linear attention can be reinterpreted as a form of associative memory, with modern variants introducing gating and error-correction for stable long-context updates. Elementwise gating performs worse than headwise gating because its 130× larger parameterization converges poorly with limited training steps.