Dispatch Overhead

1 sources - 5 claims

At DeiT sequence lengths, matrix arithmetic can finish in single-digit microseconds while high-level variable-length APIs can take tens of microseconds to launch. The paper estimates Triton's lower launch path reduces the dispatch floor from about 62 microseconds to about 40 microseconds. The main hypothesis is that pruned ViT attention is short enough for host-side dispatch to dominate latency instead of GPU arithmetic. Fixed costs such as Python argument validation, pybind11 binding, allocations, and CUDA launch dispatch can outweigh actual attention computation at ViT-scale sequence lengths. The discussion attributes the observed results to dispatch overhead rather than a different attention algorithm.