DeiT Inference

Model-scale experiments show throughput gains across DeiT-Tiny, DeiT-Small, and DeiT-Base. At 90% pruning, the measured 2.13x speedup remains below the 2.66x theoretical ceiling because not all model components benefit equally from token r…

1 sources - 5 claims

Model-scale experiments show throughput gains across DeiT-Tiny, DeiT-Small, and DeiT-Base. At 90% pruning, the measured 2.13x speedup remains below the 2.66x theoretical ceiling because not all model components benefit equally from token reduction. The end-to-end pipeline leaves layers 1 through 4 unpruned, applies pruning, packs surviving tokens, and runs layers 5 through 12 with ragged attention and MLP computation. End-to-end speedups are limited because layers 1 through 4 and MLP computation do not fully benefit from token reduction. The system is pruning-method agnostic as long as the pruning method outputs a per-token keep mask.