Problem of vram used with quantized model

by mlaszlo - opened Sep 13

Discussion

mlaszlo

Sep 13

•

edited Sep 13

Hello,

When I infer with this model (the 1B one) loaded in 4 bits from hugginface, I cannot see any speed enhancements, and when i run nvidia-smi i can see my vram going from 1gb to 4gb (infering on only one image). I'm using inference on a Tesla t4 without flash attention since it does not support turing gpus. I dont understand why the memory is going up this way.
Many thanks in advance

czczup

OpenGVLab org Sep 18

Hello,

While the 4-bit quantization primarily reduces VRAM usage, you might not see speed improvements. If you're looking for better speed and memory optimization, I recommend trying the lmdeploy inference framework. It could offer more efficient performance on your setup.

I hope this helps!

czczup changed discussion status to closed Sep 18

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment