Speculative Completion

Speculative completion improved validation throughput only when the prefix threshold was suitable. Self-speculative methods reduce auxiliary-model pressure but may require adapters and have limited shallow drafting capacity. Speculative st…

2 sources - 11 claims

Speculative completion improved validation throughput only when the prefix threshold was suitable. Self-speculative methods reduce auxiliary-model pressure but may require adapters and have limited shallow drafting capacity. Speculative stage completion allows workers to release partial output after a configured fraction of requests finishes. In evaluation, candidates can enter the pool speculatively if their partial score exceeds the current pool score. Speculative decoding improves throughput by drafting multiple candidate tokens and verifying them with a single target-model pass. Speculative artifacts are confirmed only after full evaluation satisfies the acceptance condition, otherwise they are removed. Validation-set reordering moves samples that pass for three consecutive rounds out of the speculative prefix. CATS can combine with EAGLE-style tree branching without requiring extra target-model forward passes. Classical speculative decoding assumes enough memory for both target and draft models, which fails on edge devices. Auxiliary speculative decoding methods are penalized under memory limits because the draft model adds capacity pressure and transfer traffic. A higher par…