LLM Decode

1 sources - 4 claims

Roofline analysis placed decode kernels for all tested architectures deep in the memory-bound region. LLM decode is dominated by sequential matrix-vector work and memory traffic from weights, KV cache, or state. Tensor cores were mostly idle during decode because time was spent loading data from HBM. Batching improves decode energy efficiency by amortizing weight loads but does not remove decode's low arithmetic intensity.