Triton Kernel

1 sources - 5 claims

The Triton ragged pipeline produces monotonic throughput improvements as pruning increases. At 50% pruning on DeiT-Base, the Triton pipeline is 2.04x to 2.24x faster than padded SDPA depending on batch size. The Triton kernel has a lower dispatch floor of roughly 0.040 ms in the isolated attention benchmarks. The Triton implementation benefits short pruned ViT attention mainly because its JIT launch path is lighter than the FlashAttention-2 varlen API path. The Triton kernel is not superior when compute dominates, as shown by its loss to FlashAttention-2 at batch size 64 without pruning.