Mathematical Reasoning

On AIME 2025, the 8B model reached the baseline peak level earlier with a 10.8 percentage point same-step gain. On AIME 2025, the 4B model reached the baseline peak level earlier with a 7.3 percentage point same-step gain. Math results are…

1 sources - 6 claims

On AIME 2025, the 8B model reached the baseline peak level earlier with a 10.8 percentage point same-step gain. On AIME 2025, the 4B model reached the baseline peak level earlier with a 7.3 percentage point same-step gain. Math results are limited to the locally curated AceReason-Math-Subset rather than unfiltered AceReason-Math. The most detailed mechanism audits came from the 4B mathematical-reasoning run. The mathematical-reasoning experiments trained Qwen3-4B-Instruct-2507 and Qwen3-8B without thinking mode and evaluated on AIME 2025. The AceReason-Math-Subset was constructed by excluding problems with empirical pass rate at least 75%.