4bit quantisation does not reduce vram usage.

#2
by fu-man - opened

Hi,

Technically this model should utilize about 4gb vram. However, it consumes nearly 16gb when I use vllm.

Just wondering if I am missing something?

Regards,

Fu

Neural Magic org

The weights are consuming ~5GB of RAM. The remaining RAM usage is pre-allocated memory for KV caches, which are managed internally by vLLM.

robertgshaw2 changed discussion status to closed

Sign up or log in to comment