Distillation
The Attention Recovery Rate metric provides a principled, data-driven mechanism for layer selection without exhaustive combinatorial search. ARL2's distillation approach requires far less compute than retraining from scratch, unlike system…
1 sources - 5 claims
The Attention Recovery Rate metric provides a principled, data-driven mechanism for layer selection without exhaustive combinatorial search. ARL2's distillation approach requires far less compute than retraining from scratch, unlike systems such as SANA-Video that require large-scale retraining. The full two-stage distillation pipeline consumes approximately 156 H100 GPU-hours. The sensitivity-guided layer selection framework protects Hybrid-Sensitive layers from replacement to avoid irreversible imaging quality degradation, while allowing Hybrid-Recoverable gaps to close in joint distillation. ARL2 is converted from a pretrained softmax model through a progressive two-stage distillation pipeline that trains fewer than 2% of backbone parameters.