Request-Level Energy

1 sources - 5 claims

Mamba2 achieved constant per-step decode latency regardless of context length and a large energy advantage over GQA at large batch and long context. At production-like batch size 32, MLA became the cheapest architecture from nearly the first output token due to decode efficiency. Novel attention replacements can spend more energy in prefill but later recover it through more efficient decode. At short context, MLA was worse than GQA-ctrl because weight loading dominated HBM traffic and decompression added overhead. GDN and Mamba2 used about an order of magnitude more prefill energy per token than transformer baselines in the tested vLLM setup.