Pretraining Performance
EMO reduced GPU hours by 10 percent while approaching the quality regime of fixed E=128. EMO remained behind the fixed E=128 baseline by 0.023 absolute loss and 2.3 percent relative loss. In the main 1.92T-token experiment, EMO reached a f…
1 sources - 5 claims
EMO reduced GPU hours by 10 percent while approaching the quality regime of fixed E=128. EMO remained behind the fixed E=128 baseline by 0.023 absolute loss and 2.3 percent relative loss. In the main 1.92T-token experiment, EMO reached a final pretraining loss of 1.017. Expansion-induced loss spikes recovered within roughly 10,000 steps. The final EMO model did not fully match fixed E=128 in all pretraining and downstream results.