Decoding Speed
The 1.5B Quadrivium model achieved a 57.9% wall-clock speedup while slightly improving perplexity relative to a dense baseline. Across width scaling at 24 layers, reported speedups ranged from 23.8% to 53.8% without meaningful perplexity d…
1 sources - 4 claims
The 1.5B Quadrivium model achieved a 57.9% wall-clock speedup while slightly improving perplexity relative to a dense baseline. Across width scaling at 24 layers, reported speedups ranged from 23.8% to 53.8% without meaningful perplexity degradation. Across fixed-width depth scaling, speedup generally increased with model depth, except the 6-layer model was slower. N-vium improves decoding latency by shifting work off the sequential critical path rather than reducing total per-token computation.