EMO

EMO does not require new routing architectures, nonstandard expert layers, or extra objectives beyond the existing load-balancing objective. EMO is a progressive MoE pretraining framework that grows the number of experts during training. E…

1 sources - 5 claims

EMO does not require new routing architectures, nonstandard expert layers, or extra objectives beyond the existing load-balancing objective. EMO is a progressive MoE pretraining framework that grows the number of experts during training. EMO delays expensive large-expert configurations until later in training. EMO uses five stages that double the expert pool from 8 to 128 experts. EMO treats expert capacity as expandable memory rather than requiring the full expert pool from the start.