Problem of vram used with quantized model
#5
by
mlaszlo
- opened
Hello,
When I infer with this model (the 1B one) loaded in 4 bits from hugginface, I cannot see any speed enhancements, and when i run nvidia-smi i can see my vram going from 1gb to 4gb (infering on only one image). I'm using inference on a Tesla t4 without flash attention since it does not support turing gpus. I dont understand why the memory is going up this way.
Many thanks in advance
Hello,
While the 4-bit quantization primarily reduces VRAM usage, you might not see speed improvements. If you're looking for better speed and memory optimization, I recommend trying the lmdeploy inference framework. It could offer more efficient performance on your setup.
I hope this helps!
czczup
changed discussion status to
closed