Reduced KL Loss
Reduced KL concentrates adapter capacity on tokens likely to be accepted during speculation. Focusing on top-K target tokens improves compact draft and shallow-verifier adapters under memory-limited constraints. Full-vocabulary distillatio…
1 sources - 5 claims
Reduced KL concentrates adapter capacity on tokens likely to be accepted during speculation. Focusing on top-K target tokens improves compact draft and shallow-verifier adapters under memory-limited constraints. Full-vocabulary distillation wastes supervision on low-probability tokens that rarely affect speculative acceptance. Reduced KL selects top-K target-probability tokens and computes cross-entropy only on that support. Reduced KL Loss is the central training objective for CATS adapters.