Benchmarking Methodology
The evaluation combines isolated attention-kernel benchmarks with complete pruned-ViT inference pipelines. The primary isolated-kernel setting uses DeiT-Base with 12 attention heads and head dimension 64. Timings are measured on an NVIDIA…
1 sources - 5 claims
The evaluation combines isolated attention-kernel benchmarks with complete pruned-ViT inference pipelines. The primary isolated-kernel setting uses DeiT-Base with 12 attention heads and head dimension 64. Timings are measured on an NVIDIA A100-SXM4-40GB using PyTorch 2.8, Triton 3.4, and flash-attn v2.7. The study recommends benchmarking token pruning against ragged variable-length execution and separately reporting attention and MLP latency. Input tokens are real ImageNet features extracted through the first four DeiT layers before Threshold-L2 pruning is applied.