Sparsity-Aware Scaling Law

The scaling-law schedule placed expansion in a favorable quality-cost region around 45 percent of the token budget. A dense-style scaling law cannot distinguish MoE models that share the same activated size but have different expert pools.…

1 sources - 5 claims

The scaling-law schedule placed expansion in a favorable quality-cost region around 45 percent of the token budget. A dense-style scaling law cannot distinguish MoE models that share the same activated size but have different expert pools. The fitted scaling law achieved high fit quality on training data and low held-out error. EMO uses a scaling law that explicitly includes expert count to allocate tokens across expansion stages. The fitted law estimates compute-optimal cumulative token counts for each expert count at a fixed activated size.