Quants

#1
by CyberTimon - opened

Hello!

First of all, this model is amazing. It works similar to llava vicuna 7b. Do you know if I can quantize this model to like 4 or 5 bits? Can I run this with llama cpp or what do you recommend to inference this? I'm looking for a way which uses less vram.

Thank you very much!

Kind regards
Timon Käch

There’s no reason it can’t be quantized, I just don’t know how to do it 😭😭

What tool do people typically use to quantize models? Is there a good tutorial that explains how to do it?

Hey

Thanks for the fast response. Most people quantize it with either llama.cpp or AutoGPTQ.
It's already quite late for me so but I will definitly try to quantize it tomorrow and run it in my pipeline. The model is really amazing. It works better than llava 7 or even 13b in my early tests.

Maybe it works out of the box with bitsandbytes transformers quantization support, let me check tomorrow. I will tell you if it worked or not.

Thanks, you created an amazing model!

this is an amazing model it works fast on my rtx 3090 . It would be helpful to have a quantized model although the footprint is small already.

@CyberTimon did you try quantizing using llama.cpp ? this model works really fast and accurate enough for personal use on my mac. I was trying out on raspberry pi but process gets killed due to size. I think ggml will be helpful here just like they have llava quantised version which works on 8gb raspi but with 1.5 tokens/sec. This should work really fast.

Nope I didn't test it as it's built different than other llava models. Because of that I didn't even try it.

https://huggingface.co/vikhyatk/moondream1/discussions/8

Sounds like vikhyatk is already working to correct the oversight of using fp32 instead of fp16 when merging the models. The model was larger then necessary to begin with. This should drop the size of this models base. Which will be nice. Then quantization would give even further benefits after supplying fp16.

Sign up or log in to comment