Ragged Attention
Variable-length execution is presented as necessary for making pruning meaningful in practice, but only if dispatch overhead is low enough. The ragged attention kernel implements the FlashAttention-2 online softmax algorithm specialized fo…
1 sources - 5 claims
Variable-length execution is presented as necessary for making pruning meaningful in practice, but only if dispatch overhead is low enough. The ragged attention kernel implements the FlashAttention-2 online softmax algorithm specialized for bidirectional ViT inference. For 39 tokens per image, a single 64-by-64 tile pair covers a full image-head attention computation. Ragged attention uses packed surviving tokens and cumulative sequence lengths to represent variable-length pruned batches. The proposed ragged attention kernel is written in Triton and is part of a three-component system with token packing and end-to-end DeiT integration.