Attention Architectures

1 sources - 5 claims

The study compares GQA, MLA, GDN, and Mamba2 as distinct attention or attention-replacement paradigms. MLA uses a compressed latent cache instead of the larger GQA cache in the controlled Minitron-4B comparison. MLA and Mamba2 require higher optimal clocks at larger batches because additional per-step work matters more. GQA and GQA-ctrl remain memory-bound across batch sizes, allowing one low decode clock to work broadly. GDN tolerates aggressive underclocking because its decode path is extremely compute-light.