May I ask if there are plans to provide 8-bit or 4-bit quantized versions?
Thank you for creating the StarCoder model. However, it is estimated that only GPUs like the A100 will be able to perform inference with this model. May I ask if there are plans to provide 8-bit or 4-bit quantized versions?
Hi
@intelligencegear
,
You can already use the 8bit model out of the box, by installing bitsandbytes
and accelerate
. Just run the following:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder", load_in_8bit=True, device_map="auto")
...
You can have a deeper look of what it is using under the hood here
load_in_8bit != GPTQ quantized. The performance is vastly difference with GPTQ being superior.
These are still a work in progress, but you can get early versions here: https://huggingface.co/mayank31398
I believe running time is not as good as we'd like, but quality of generated code seems to be good. A full evaluation is pending.
These are still a work in progress, but you can get early versions here: https://huggingface.co/mayank31398
I believe running time is not as good as we'd like, but quality of generated code seems to be good. A full evaluation is pending.
The job is awesome, and I have tested it myself. However, there may not be a directly usable checkpoint available due to licensing reasons. Instead, it needs to be converted on the machine memory. In other words, a machine with possibly 64GB of memory is required, and the hardware requirement is relatively high.
load_in_8bit != GPTQ quantized. The performance is vastly difference with GPTQ being superior.
Yes, that's exactly what I wanted to say.
yes, LLM.int8() works quite differently to GPTQ.
LLM.int8() which is achieved by load_in_8bit=True
does quantization at load time.
However, GPTQ uses Optimal Brain Quantization for GPT-like models and this requires a samples for quantization.
@mayank31398 Do you think there will be a front end or interactive mode that works with your repo? Santa coder is great but without a chat like interface that can maintain context, Starcoder pretty much becomes unusable except for very specific situations. Thanks!
hey
@syntaxing
there is already a model called starchat. Demo here: https://huggingface.co/spaces/HuggingFaceH4/starchat-playground
yes, LLM.int8() works quite differently to GPTQ.
LLM.int8() which is achieved byload_in_8bit=True
does quantization at load time.
However, GPTQ uses Optimal Brain Quantization for GPT-like models and this requires a samples for quantization.
@mayank31398
@syntaxing
it is indeed a different process, but wasn't it shown that LLM.int8()
does not result in statistically significant performance degradation in this blog post? In that case, what's the benefit of going with a process that requires data samples and GPU hours (talking about the 8bit case, not the 4bit case)?
Mainly asking to validate if I'm leaving performance on the table by using bitsandbytes.
@JacopoBandoni
sorry for the late reply.
GPTQ and LLM.int8() are completely different quantization algorithms.
Please refer to their papers for the same.