FlashAttention-2 Varlen

1 sources - 5 claims

FlashAttention-2 varlen latency stays nearly flat around 0.062 to 0.063 ms across many pruning ratios and batch sizes. FlashAttention-2 performs better than the Triton kernel for large unpruned workloads where arithmetic dominates. At ViT-scale short sequence lengths, FlashAttention-2 varlen is described as essentially all overhead at batch size 32. FlashAttention-2 remains preferable for long contexts, causal masking, KV-cache usage, and compute-dominated cases. FlashAttention-2 varlen is optimized for long-context language-model workloads where fixed overheads are amortized over many tokens.