Layer-Aligned Distillation
The MiniLM-L12 baseline produced no early exits by layer 7 despite having higher STS-B than LEAP in the controlled comparison. The article attributes early-exit incompatibility specifically to intermediate-layer alignment rather than to di…
1 sources - 5 claims
The MiniLM-L12 baseline produced no early exits by layer 7 despite having higher STS-B than LEAP in the controlled comparison. The article attributes early-exit incompatibility specifically to intermediate-layer alignment rather than to distillation in general. Layer-aligned distillation trains each student layer to match a mapped teacher layer. Standard layer-aligned distilled models tend to transform representations across layers rather than stabilizing before the final layer. Layer-aligned distillation suppresses the redundancy needed by convergence-based early exit.