Which is better? GGML of GPTQ version or of the merged deltas

#3
by Reggie - opened

Hi,
Great work! Was just wondering if this GGML is expected to be better than say, the converted GGML of your GPTQ version?

So FYI my GGMLs are not converted from the GPTQs. They're converted directly from the unquantised original repo. I used to try making GGMLs from 4bit GPTQs, but the files scored very badly on artificial benchmarks and I could never understand why. And then the llama.cpp quantisation methods improved significantly, so there was no point even trying that.

In terms of inference quality, I believe the quantised GGMLs have now overtaken GPTQ in benchmarks. There's an artificial LLM benchmark called perplexity. GPTQ scores well and used to be better than q4_0 GGML, but recently the llama.cpp team have done a ton of work on 4bit quantisation and their new methods q4_2 and q4_3 now beat 4bit GPTQ in this benchmark.

Then the new 5bit methods q5_0 and q5_1 are even better than that.

So if you want the absolute maximum inference quality - but don't have the resources to load the model in 16bit or 8bit - you would go for 4bit or 5bit GGML.

However in practical usage most people aren't going to be able to tell the difference between a very good quantisation and a slightly better quantisation. So the bigger question is whether you want CPU or GPU inference.

If you have an NVidia GPU and can do GPU inference then generally it should be much faster than CPU inference, and most people will care about that a lot more than they will a 3% improvement on an artificial benchmark.

So as a general rule I would say: if you have an NVidia GPU and can do GPU inference, you should.

All that said: there are some GPTQ performance problems with WizardLM specifically, which I've not been able to resolve yet. They're not a big issue if you use Linux or WSL2 and can use the Triton branch of GPTQ-for-LLaMa. But if you need to use the CUDA branch - eg because you're on Windows - you may find you get only 1-4tokens/s in GPU inference. And llama.cpp GGML can beat that on many CPUs.

So it's a bit complicated for this model! But for any other 7B model, GPU inference will win on speed.

Thank you for the detailed explanation. Didn't realize Q4_3 & 4_2 were doing the rounds in llama.cpp. These things move so fast. 2 weeks is all you need to go out of touch!
Just wondering if adding a BLAS flag would improve speed? I read in some issue that BLAS speedup is kind of baked into the new 4_3 & 4_2 methods. Or am I understanding it wrong?

If you can add BLAS then I would. There's certainly no harm.

I know that BLAS is used for perplexity calculations and I believe it's also used during inference for prompt evaluation. So it does help a bit with performance, especially on long prompts.

On Ubuntu, I do the following on a new system:

sudo apt update -y && sudo apt install -y libopenblas-dev libopenblas-base
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
LLAMA_OPENBLAS=1 make

Or if you have an NVidia GPU with CUDA installed, you can use CUBLAS instead with LLAMA_CUBLAS=1 make. That is significantly faster than using OpenBLAS - for example when calculating perplexity on a 65B model, it was going to take 55 hours with OpenBLAS, but 3.5 hours with CUBLAS!

Thank you for all the great models and for being so helpful to the "less enlightened" like me!

First of all, thank you for your work. Would it pay off to use 8bit/16bit GGML versions considering that it is a 7B?

First of all, thank you for your work. Would it pay off to use 8bit/16bit GGML versions considering that it is a 7B?

Definitely not. The inference will be much slower and the difference in theoretical accuracy between q5_1 and fp16 is so low that I can't see how it'd be worth it being so much slower.

Here's the benchmark table from the llama.cpp README:

image.png

For 7B, the difference in accuracy between q5_1 and fp16 is 0.006%! But the difference in speed is very significant. With 8 threads (-t 8), fp16 is 128ms/token (8 tokens/s) compared to 59m/s for q5_1 (17 tokens/s).

Do you have any inputs on the optimal number of threads? For 13B models I noticed inference speed seems to peak at 12 threads, and falls of after that. Havent really tested with 30B or 65B models.
Also any idea if it's possible to do initial prompt ingestion with the GPU (CuBLAS) and then switch back to CPU for inference. CPU with 12 threads is actually faster at inference than the GPU for me, but GPU knocks it out of the park in terms of ingestion of long prompts.

The maximum you should try is your number of physical CPU cores. Hyperthreading doesn't seem to help much or at all.

I have an 18 core, 36 thread system and so I have been using -t 18.

However yes there do appear to be bottlenecks with higher thread counts. I've seen some graphs that show a drop off in performance above a certain number of threads, and an indication that extra threads are basically just idling.

I just did a few quick tests, on WizardLM q5_1.

Threads Run 1 ms/token Run 2 ms/token Run 3 ms/token
18 101.8 102.1 102.1
16 85.4 85 86.8
12 92.8 93.5 93.5
10 107.5 106 107.3

(Lower figures are better)

So the maximum performance was with 16 threads. But I'd say 12 threads makes more sense if efficiency is any consideration, as the increase from 12 to 16 was tiny.

And If we were to work out the performance per core we'd see that it's far from scaling linearly. So if maximum efficiency was key, maybe even fewer threads would be best. Maybe that's why they only went up to 8 threads in the llama.cpp benchmark.

I don't know how this varies with model size and quantisation method. That might be worth some additional tests.

Just wanted to say thank you for your very helpful responses. 😊

Glad to help!

Sign up or log in to comment