Vision Transformer Token Pruning

At 80% pruning on DeiT-Base, attention FLOPs fall by about 96% because attention cost scales quadratically with token count. Fisher sensitivity is described as better aligned with semantic and structural reliability than attention-only pru…

2 sources - 10 claims

At 80% pruning on DeiT-Base, attention FLOPs fall by about 96% because attention cost scales quadratically with token count. Fisher sensitivity is described as better aligned with semantic and structural reliability than attention-only pruning. Padded PyTorch execution is reported to be slower than unpruned inference across pruning ratios. The article argues that attention or token-merging heuristics may discard low-attention tokens that remain important. Token-filtered training couples adapter-coordinate Fisher estimates to selected token losses through gradients. Token pruning methods reduce theoretical attention cost by removing less informative image patch tokens after early transformer layers. Fed-FSTQ forms a Top-K token mask from EMA sensitivity scores. Fed-FSTQ does not require explicit token-to-parameter mappings because it uses a token-filtered training objective. The study argues that pruning speedups in current ViT pipelines may come more from reduced MLP work than from reduced attention work. Padding variable-length pruned batches can prevent theoretical FLOP reductions from becoming actual latency reductions on GPUs.