EMO
EMO does not require new routing architectures, nonstandard expert layers, or extra objectives beyond the existing load-balancing objective. EMO is a progressive MoE pretraining framework that grows the number of experts during training. E…
1 sources - 5 claims
EMO does not require new routing architectures, nonstandard expert layers, or extra objectives beyond the existing load-balancing objective. EMO is a progressive MoE pretraining framework that grows the number of experts during training. EMO delays expensive large-expert configurations until later in training. EMO uses five stages that double the expert pool from 8 to 128 experts. EMO treats expert capacity as expandable memory rather than requiring the full expert pool from the start.