matlok 's Collections
LMM

Papers - Speculative Decoding - KV Cache

we recognize two memory bottlenecks: model weights and KV cache, and the latter gradually bottleneck(s) as context length increases