Autoregressive Transformers

1 sources - 5 claims

GPT learned and memorized faster than matched DiT models, leading to a shorter innovation window. GPT models reproduced the two-clock structure of rule learning followed by memorization. GPT rule learning was concentrated at the last bit of each parity group. GPT weight decay delayed memorization more strongly than DiT weight decay in the optimization ablations. Autoregressive generalization was tested with GPT2-style transformers on the same parity datasets.