Question about which .bin file to use and quantization

#7
by florestankorp - opened

Hi all!

After a few weeks of leaving my system dormant I decided to jump back into the local LLM frenzy and am pleased to say that after some tinkering with Conda environments and Python packages yesterday I managed to run Wizard-Vicuna-13B-Uncensored on my Apple M2 Pro - 16 GB RAM.

Now, when I first tried inspecting the files on Hugging Face I saw that there were a bunch of new files that I didn't encounter the last time I played with LLMs. As I have a machine that is not ideal for running LLMs locally, my understanding is that I need 4-bit quantized GGML models, so here we are.

Because I didn't have the time to download all the files I started with these two:

  • Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin
  • Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_1.bin

By the way, I assumed that both files are needed to complete the model, similar to a .zip file. Is that accurate or can I get away with only using one? How will that affect output quality and speed?

Leaving only the q4_0.bin and q4_1.bin files in my /models folder and then startimg the chat gave me a decent 5 tokens/s and I've been chatting away with this wonderful model. But now my question is what are all the other files for then and could they help me improve the quality and/or speed of my chat experience?

For example when I have only *_K_M.bin files, then I can't even get the model to load. So what are these files even for? As I've gathered from the commit messages from TheBloke's Hugging Face page it's something pertaining to k-quant, but I can't find any information as to what they're used for...

Then, would I benefit from ONLY running the q8_0.bin for example or is my hardware not equipped to handle that load or am I better off running q4_0.bin or maybe I should go for the "newer" q5_0.bin?

Here are all files available for download in the "Files and versions" tab:

    Wizard-Vicuna-13B-Uncensored.ggmlv3.q2_K.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q3_K_L.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q3_K_M.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q3_K_S.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_1.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_S.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_0.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_1.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_K_M.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_K_S.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q6_K.bin
    Wizard-Vicuna-13B-Uncensored.ggmlv3.q8_0.bin

PS: I run the model through oobabooga/text-generation-webui with the following command:

python server.py --threads=8 --gpu-memory=10 --mlock --chat --model=TheBloke_Wizard-Vicuna-13B-Uncensored-GGML

You only need one file. q4_0 and q4_1 are different models, and you can use either but don't need both.

As for the different sizes: the larger the file the better the accuracy, but the more resources required and the slower the speed. q4_0 and q5_0 are good compromises.

The formats with the letter K in their name are a new type of quantisation. They're generally better than the old types, but don't yet have as wide support. They don't yet work in text-generation-webui for example, I think.

As you only have 16GB RAM, I would limit yourself to 13B models in q4_0 for now.

FYI, llama.cpp recently added Metal acceleration which should give you much better performance. Again I don't think that's yet supported in text-generation-webui, but I would expect it to come in the next week or so. It won't enable you to run larger models (you're still limited by that 16GB), but it will give you much faster performance.

I have tested some new quantization..
It looks like models with 4bit, especially q4_K_S better work with Nvidia 2000 (I have 2080). If other bit models (i tried everything, except 2bit), finished with not completed sentences, when q4_K_S finished all of them in right context. It's weird I think because 4 bit is half of 8 bit and it helps to finish the generation without cutting them off.
P.S I mean when I try to generate 512 tokens. If about performance, they all work similar without noticeable boost.

A question to TheBlock: which is better in performance: ggmlv3.q5_K_S.bin OR ggmlv3.q5_K_M.bin ?? Any difference in accuracy?

q5_K_M will be slightly better in accuracy. The bigger the file, the better the accuracy, but also the more RAM/VRAM needed and the slower inference will be. So you pick the right compromise for your needs/hardware.

Sign up or log in to comment