Expansion Initialization
Carrying optimizer state across expansion gave only marginal benefit that disappeared after about 500 warmup steps. Initialization choices changed transient expansion spikes more than final loss. Copying from old checkpoints produced the l…
1 sources - 5 claims
Carrying optimizer state across expansion gave only marginal benefit that disappeared after about 500 warmup steps. Initialization choices changed transient expansion spikes more than final loss. Copying from old checkpoints produced the largest initial spike but stabilized fastest. Optimizer states are reset at expansion to avoid Adam moment-buffer dimension mismatches when new expert rows are added. At expansion boundaries, old experts are retained while new experts and router rows are Gaussian initialized.