Evaluation Scope

1 sources - 5 claims

The evaluation used STS-B, QQP, and five BEIR retrieval datasets. Validation is limited mainly to English sentence similarity, BEIR retrieval, and 12-layer backbones. Specialized domains may need held-out validation and threshold recalibration before deployment. Token-level tasks would need a different exit statistic or task-specific exit heads. The method is designed for embedding-level tasks rather than token-level exit tasks.