KV Cache

ARL2 reduces per-frame memory for hybrid layers by 40%, from 293 MB to 175 MB. Unlike auxiliary linear-complexity modules that preserve the primary softmax path and its O(N) memory, ARL2 fully replaces inter-frame attention with constant-m…

1 sources - 4 claims

ARL2 reduces per-frame memory for hybrid layers by 40%, from 293 MB to 175 MB. Unlike auxiliary linear-complexity modules that preserve the primary softmax path and its O(N) memory, ARL2 fully replaces inter-frame attention with constant-memory recurrence. Sparse attention, KV cache quantization, and KV cache eviction each have fundamental shortcomings and none simultaneously solves linear memory growth and proper streaming context management. KV cache eviction bounds memory at the cost of irreversibly discarding past context.