Layer-Aligned Distillation

1 sources - 5 claims

The MiniLM-L12 baseline produced no early exits by layer 7 despite having higher STS-B than LEAP in the controlled comparison. The article attributes early-exit incompatibility specifically to intermediate-layer alignment rather than to distillation in general. Layer-aligned distillation trains each student layer to match a mapped teacher layer. Standard layer-aligned distilled models tend to transform representations across layers rather than stabilizing before the final layer. Layer-aligned distillation suppresses the redundancy needed by convergence-based early exit.