MoE Efficiency Paradox
MoE efficiency should be evaluated using wall-clock costs tied to total expert count, not only activated FLOPs. Increasing the expert pool from 8 to 128 increased wall-clock step time by 1.08x for the 1.1B activated-parameter model and 1.7…
1 sources - 4 claims
MoE efficiency should be evaluated using wall-clock costs tied to total expert count, not only activated FLOPs. Increasing the expert pool from 8 to 128 increased wall-clock step time by 1.08x for the 1.1B activated-parameter model and 1.72x for the 4B activated-parameter model. The MoE efficiency paradox is the mismatch between nearly constant theoretical FLOPs and worsening real training efficiency as expert count grows. Wall-clock step time rises with expert count because communication, storage, routing overhead, and small expert computations scale with the full expert pool.