Autoregressive Video Diffusion

1 sources - 4 claims

AR video diffusion has a heterogeneous attention structure where intra-frame attention is bidirectional and inter-frame attention is causal, unlike the homogeneous causal attention in LLMs. Autoregressive video diffusion systems generate video chunk-by-chunk in a causal frame-wise manner and rely on KV caching for streaming inference. Softmax self-attention inside Diffusion Transformers incurs O(N²) compute and O(N) memory scaling with sequence length. The KV cache for a 5-second 480p video can exceed 34 GB, and attention accounts for approximately 75% of total latency after only 14 generated chunks.