Quantized model coming?

#3
by dnhkng - opened

Do you plan to release a quantized version, or do you have tips on the best way to quantize this model for local inference?

Hi, I just uploaded two quantized versions: an 8bit and a 4bit, using BitsAndBytes conversions. Would be curious how it works for you!

A GGUF would be nice for people who do not have nvidia gpus.
UPD: oh, it is not yet supported https://github.com/ggerganov/llama.cpp/issues/6803

OpenGVLab org

Thank you for your attention. We plan to release the quantized model in the next few days. We have successfully implemented 8-bit quantization, but encountered some issues with 4-bit.

Is it in plan to look into the issues with the 4-bit version?

https://huggingface.co/failspy/InternVL-Chat-V1-5-4bit
This fits on my 3090 but the response is only an empty string "" to such an image with the following prompt (question)

question = "Please describe the picture in detail"
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(question, response)

I was following https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5#model-usage , just switched path = "OpenGVLab/InternVL-Chat-V1-5" to path = "failspy/InternVL-Chat-V1-5-4bit" and load_in_4bit instead of load_in_8bit, but got "" as response from model.:( But at least I hope that this means I will be able to run a working one, I hope so because the online demo has impressed me big time!

image.png

@nofreewill

I have the same issue with the failspy/InternVL-Chat-V1-5-4bit on my 3090TI... Possibly something with the quantization_config in the config.json possibly? I'm curious if anyone has actually run this locally in 24GB VRAM...

I tried it locally with 22GB VRAM Nvidia A10G with AQM Quant , not enough memory. How much total needed for 4 bit q?

czczup changed discussion status to closed

Sign up or log in to comment