Memory-Limited Edge Inference

Profiling Vicuna-7B on edge shows flash transfer dominates per-token latency. Standard autoregressive LLM decoding becomes memory-bound because each generated token requires reloading weights and intermediate state. On edge devices, the do…

1 sources - 5 claims

Profiling Vicuna-7B on edge shows flash transfer dominates per-token latency. Standard autoregressive LLM decoding becomes memory-bound because each generated token requires reloading weights and intermediate state. On edge devices, the dominant bottleneck shifts from HBM-to-SRAM transfer to flash-to-DRAM transfer. Edge speculative decoding should optimize flash-to-DRAM movement rather than only maximizing accepted-token count. Jetson AGX Orin has too little unified DRAM to keep a 7B model fully resident, requiring chunked streaming from flash.