PhaseAwareRMSNorm

Long-horizon TinyStories runs found PhaseAwareRMSNorm became a small late-training win. In cumulative refinement experiments, PhaseAwareRMSNorm eventually helped despite hurting short-horizon performance in the full grid. PhaseAwareRMSNorm…

1 sources - 4 claims

Long-horizon TinyStories runs found PhaseAwareRMSNorm became a small late-training win. In cumulative refinement experiments, PhaseAwareRMSNorm eventually helped despite hurting short-horizon performance in the full grid. PhaseAwareRMSNorm replaces global RMSNorm with independent normalization over each phase at pre-attention, pre-FFN, and final norm sites. PhaseAwareRMSNorm does not change parameter count because its per-phase scale vectors concatenate to the model dimension.