ARL2
At 1,005 frames ARL2 completes generation at 43 GB while Causal Forcing runs out of memory at over 91 GB. ARL2 at 50% layer replacement achieves the highest Quality Average (87.17) among all distilled models evaluated, improving temporal f…
1 sources - 7 claims
At 1,005 frames ARL2 completes generation at 43 GB while Causal Forcing runs out of memory at over 91 GB. ARL2 at 50% layer replacement achieves the highest Quality Average (87.17) among all distilled models evaluated, improving temporal flickering by +0.99 and motion smoothness by +1.35 over its teacher. ARL2 with 1.3B parameters matches the quality of MAGI-1, which has 4.5B parameters. ARL2 is the first work to convert a pretrained autoregressive video diffusion model into a hybrid linear attention architecture. ARL2 decomposes self-attention into two parallel branches: softmax attention for intra-frame spatial structure and a fixed-size recurrent state for inter-frame temporal memory. ARL2's improvements in temporal flickering and motion smoothness are attributed to the recurrent state providing more stable long-range context than a linearly growing KV cache. ARL2 underperforms the teacher model on spatial relationships and human actions due to the inherent constraint of a fixed-size recurrent state.