Mixture Objective
Training optimizes the mixture distribution directly with cross-entropy. Router warmup is used early in training to prevent collapse by encouraging uniform expected exit use. The method adds an expected normalized depth penalty to encourag…
1 sources - 4 claims
Training optimizes the mixture distribution directly with cross-entropy. Router warmup is used early in training to prevent collapse by encouraging uniform expected exit use. The method adds an expected normalized depth penalty to encourage efficient exits. With beta set to zero, N-vium reduces to next-token optimization over the mixture without a compute penalty.