Latency

On an NVIDIA L4 at batch size 1, early exit reduced latency from 8.46 ms to 5.25 ms. Measured wall-clock speedup was lower than theoretical layer reduction, especially at larger batch sizes. Throughput-oriented offline workloads may prefer…

1 sources - 5 claims

On an NVIDIA L4 at batch size 1, early exit reduced latency from 8.46 ms to 5.25 ms. Measured wall-clock speedup was lower than theoretical layer reduction, especially at larger batch sizes. Throughput-oriented offline workloads may prefer a shallow student model over LEAP early exit. The speedup diminished at batch size 32 because GPU parallelism already amortized layer cost. LEAP is positioned as most useful for latency-sensitive embedding services that plan to use early exit.