This is cooked

#2
by supercharge19 - opened

This model is cooked, over cooked actually. Could you try dynamic quantization or or is it already the best possible quantization for this class (1.58)?

Also, what is the best quantization for 1b instruct model. Could you please also try 4bit but dynamic as suggested by unsloth?

Technology Innovation Institute org
edited 2 days ago

Hi @supercharge19
1.58 BitNet models are a complete separate type of models. As you can see from the performance section of the model card, we do not pretend this model to be a SoTA model, it's an open model for research purpose. This model has been built on based on recent research around 1.58bit models: https://huggingface.co/blog/1_58_llm_extreme_quantization / https://huggingface.co/papers/2402.17764 (more to come in the upcoming technical report) - 1.58bit models can be really exciting for the future as they can provide extreme compression rate coupled with the fact that multiplications wouldn't be required to run these models (except for the LM head) - if we demonstrate in the future that we can get very competitive 1.58bit models it would create exciting opportunities (e.g. creating specialized "almost-matmul-free" hardware (since LM head would require matrix multiplications))
Feel free to read more about it by reading over the resources shared above and: https://github.com/microsoft/BitNet
You can of course quantize the original model (tiiuae/Falcon3-10B-Instruct) using mature methods such as 4-bit bitsandbytes and any other quantization scheme supported here: https://huggingface.co/docs/transformers/quantization/overview as the architecture is llama based and will give you much superior performance than this model

Thank you for responding.
I thought bitNet was trained from scratch but not with f16 or other precision, rather with 0s and 1s (or if I'm not recalling incorrectly -1, 0,1), so its generation quality did not suffer, as network learned connections differently from if they were trained on 16bit and then precision was thrown away with 1.58 quants.
I'm not saying it is like BitNet, I was just hoping that dynamic quantization method was used as suggested by unsloth in their blog: https://unsloth.ai/blog/dynamic-4bit (I was hoping that most layers retained generation quality despite being quantized or were not quantized if quality were to suffer).

I am just wondering if it is even possible to quantize a model to 1bit or even sub-bit quants while keeping same or close to original quality. Or we can't go lower than 4bit?

Sign up or log in to comment