4bit quantisation does not reduce vram usage.

by fu-man - opened Jul 2

fu-man

Jul 2

Hi,

Technically this model should utilize about 4gb vram. However, it consumes nearly 16gb when I use vllm.

Just wondering if I am missing something?

Regards,

Neural Magic org Jul 2

The weights are consuming ~5GB of RAM. The remaining RAM usage is pre-allocated memory for KV caches, which are managed internally by vLLM.

robertgshaw2 changed discussion status to closed Jul 2

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment