Mixture-of-Experts Transformers

MoE architectures can increase total parameters without proportionally increasing theoretical computation. Sparse MoE transformers replace dense feed-forward layers with multiple experts while routing each token to only a small subset. Wit…

1 sources - 4 claims

MoE architectures can increase total parameters without proportionally increasing theoretical computation. Sparse MoE transformers replace dense feed-forward layers with multiple experts while routing each token to only a small subset. With top-k routing, per-token FLOPs depend mainly on activated experts and activated parameters rather than the full expert pool. The MoE layer combines the outputs of the top-k selected experts using normalized router weights.