Quant?

#1
by zappa2005 - opened

Hi Maziyar, thanks the upload! zepyhr and Mistral-Instruct seems like a very good combination for RP. Can you help me to find the actual bpw for this model - it is not in the name, and I have no clue where to find the value otherwise in the model card.

Thank you!

Hi @zappa2005

Thanks for the feedback and I am glad it is interesting for RP task.

Can you help me to find the actual bpw for this model

Absolutely! As far as I know this is a merge of mistralai/Mistral-7B-Instruct-v0.2 and HuggingFaceH4/zephyr-7b-beta original models in 16bits (unless mergekit used 32bit by default). This model is not quantized. I know how to do GPTQ if that's helpful, and I can upload it after.

GPTQ is 4bit then? I am also not clever enough to quantize myself, but the GPTQ would be a nice test!

Btw, what is the difference between loading a FP16 model via "load-in-8bit" or "load-in-4bit" compared to a 4 or 8bpw quant?

EDIT: Answer myself: load-in-8bit compared to load-in-16 bit is dramatically slower in terms of inference. :-)

@zappa2005 Actually, I now know how to do both GGUF as well as GPTQ. So you can have from 2-bit all the way to 8-bit in GGUF (CPU and GPU) or GPTQ (4-bit only GPUs):

The native load in 4-bit or 8-bit in Hugging Face uses Bitsandbytes, I read a good comment on the difference between that and GPTQ:

GPTQ uses Integer quantization + an optimization procedure that relies on an input mini-batch to perform the quantization. Once the quantization is completed, the weights can be stored and reused. Bitsandbytes can perform integer quantization but also supports many other formats. However, bitsandbytes does not perform an optimization procedure that involves an input mini-batch to perform quantization. That is why it can be used directly for any model. However, it is less precise than GPTQ since information of the input data helps with quantization. - https://github.com/TimDettmers/bitsandbytes/issues/539

Thank you! I could pre-test it with Transformers and load-in-4bit, which also manages to cope with 32k context on my 16G 4080 - which is nice!

If the manual processed quant is better, I'd also like to test the 4 and 5bit GGUF. But only if you have time for it - I should seriously start looking into quantizing stuff for myself...

Thank you! I could pre-test it with Transformers and load-in-4bit, which also manages to cope with 32k context on my 16G 4080 - which is nice!

If the manual processed quant is better, I'd also like to test the 4 and 5bit GGUF. But only if you have time for it - I should seriously start looking into quantizing stuff for myself...

I am glad it's useful to you. All the GGUF files (4 and 5 bits) are here, you can easily download and test them: https://huggingface.co/MaziyarPanahi/zephyr-7b-beta-Mistral-7B-Instruct-v0.2-GGUF/tree/main

I was told Q5_K_M quantized models are the best in terms of quality to vRAM requirement ratio.

image.png

I noticed you added details to your model info page, also regarding the prompt style (was missing before). Your sample looks like ChatML, but Mistral is using this [INST] xxx [/INST] stuff.

So I should switch to ChatML with the HTML-like tags?

I noticed you added details to your model info page, also regarding the prompt style (was missing before). Your sample looks like ChatML, but Mistral is using this [INST] xxx [/INST] stuff.

So I should switch to ChatML with the HTML-like tags?

That's just an example, since these are merges of two different models, two or more different prompt templates can be used. But I'd agree that the safest is to go with the Mistral template, and if not the template from the second original model would be an option as well

I tried the Q8_0 quant and I did not manage to inference it. It loaded with llama.cpp, and upon querying in ooga it just spit out gibberish. Crash with llama.cpp_HF and same crash with koboldCPP btw.

[WinError -529697949] Windows Error 0xe06d7363

Can you confirm, or is it something on my side?

@zappa2005

I cannot load any GGUF file in my LM Studio for this model! Initially, I thought maybe the Mistral v0.2 requires a larger context so I need more VRAM. But I can load other merges with Mistral v0.2:

image.png

There must be something with zephyr-7b-beta. It says I don't have enough memory even for the Q2! I will convert to GGUF again and test it right there first in 16-bit and then test the quantized models.

I re-did everything, so now I can load any of them in LM Studio (before it was failing to even load them!):

image.png

Thank you, I'm going to retest The Q8_0.gguf

Sign up or log in to comment