Out-of-Distribution Generalization
OOD gains are the most pronounced results, with the largest single-benchmark improvement being +45.9% on AIME25 for Qwen2.5-Math-1.5B trained on MATH. LZE improvements are consistent across model scales from 1.5B to 8B parameters and acros…
1 sources - 5 claims
OOD gains are the most pronounced results, with the largest single-benchmark improvement being +45.9% on AIME25 for Qwen2.5-Math-1.5B trained on MATH. LZE improvements are consistent across model scales from 1.5B to 8B parameters and across both Qwen2.5 and Qwen3 architecture families, indicating broad applicability. Smaller models (1.5B, 1.7B) show especially large OOD gains because the learning zone is narrower for smaller models, making the Energy Score's discriminative power highest there. OOD improvements are attributed to avoiding overfitting to the in-distribution difficulty profile; suppressing trivially solved and completely failed prompts prevents the policy from memorizing training difficulty distributions. Learning-zone targeting may be an important inductive bias for generalization in mathematical reasoning, distinct from and complementary to architectural choices.