Whats the difference?

#3
by ehartford - opened

Whats the difference between
guanaco-33b-merged and
guanaco-33b ?

Also thanks for setting precedent, henceforth I'll be calling it 33b instead of 30b too

guanaco-33b = LoRA only
guanaco-33b-merged = full model

?

ok thank you :D

These merged weights are the size of the full precision weights?

I thought QLORA was done against the 4bit weights, or did I misunderstand?

Please fill my empty brain~

edit: ah, I see.. the fine-tuning is actually against the unquantized HF model.

The model seems very promising, but I can't seem to get good inference speed compared to other GPTQ 4-bit quantized models. I'm running this model on a 4090 and I'm getting only a few tokens per second following these instructions:
https://github.com/artidoro/qlora#quantization

When I run the same prompt on the HF Spaces, I get much, much faster inference (probably 20-40 tokens per second):
https://huggingface.co/spaces/uwnlp/guanaco-playground-tgi

Any suggestions for speeding things up?

Thanks!

@LoneStriker

Maybe give TheBloke/guanaco-33B-GPTQ or TheBloke/guanaco-33B-GGML a try instead?

I would not expect my 3090, or your 4090, to ever match HF Spaces inference speed as they most likely run better hardware.

@mancub thanks for the reference, I somehow missed TheBloke's release of this model (I use his other quantized models predominantly). The inference speed of TheBloke's quantized model is as fast or faster than the HF Spaces demo, so problem solved! The consumer cards are usually as fast or faster than the enterprise versions, they just don't have the same VRAM capacity. So, NVidia can charge enterprise customers a lot more money and not have them using much cheaper consumer cards. Tim's qLora library, though, will let us lowly consumers fine-tune and run much large models now, so we will have the best of both worlds.

I don't know, maybe it's my hardware setup then...but I can maybe get 5-6 t/s if I'm lucky on my 3090. I got a boatload of RAM as well as a dual Xeon v3 system, and load the entire model into VRAM.

I find it that GGML performs much better here than GPTQ though (6-7 t/s vs 4-5 t/s respectively).

I don't know, maybe it's my hardware setup then...but I can maybe get 5-6 t/s if I'm lucky on my 3090. I got a boatload of RAM as well as a dual Xeon v3 system, and load the entire model into VRAM.

I find it that GGML performs much better here than GPTQ though (6-7 t/s vs 4-5 t/s respectively).

Chatting with @TheBloke on his Discord, he indicated that performance depends primarily on single-thread CPU performance along with GPU speed. I'm running on an 13900K CPU.

Yeah, unfortunately single core CPU performance is a bottleneck for pytorch / GPTQ inference. At least when running one prompt at a time, as most of us do.

For example I was getting 28 tokens/s with a 7B model on a 4090 GPU on an AMD EPYC 24-core server CPU. Then I tried the same model on a 4090 on an i9-13900K, and got 98 tokens/s!

The CPUs that I know that perform similarly well are:

  • Intel i9-13900K
  • Intel i7-13700k
  • Ryzen 7950X and 7900X

If you google "single core CPU benchmark" you'll find that all these CPUs are right at the top. And unfortunately that's currently very important.

Maybe in future there will be a way to do multithreading with pytorch models, even for single prompts.

So yeah, it's definitely worth testing GGML + GPU acceleration to see how it performs in comparison.

But is the issue here CPU/PCI bandwidth, clock frequency, or something else?

It seems like it's all about brute force, if anything. There's no sophistication of any kind.

What would be a benefit of running multiple prompts at the same time, and how could that even be done with a single GPU?

I have two 3090s though using only one atm. I guess NVlinking them would make no difference with inference, and all I might get is more VRAM (48GB). And with QLoRA that does not matter as much now, does it?

But is the issue here CPU/PCI bandwidth, clock frequency, or something else?

It seems like it's all about brute force, if anything. There's no sophistication of any kind.

What would be a benefit of running multiple prompts at the same time, and how could that even be done with a single GPU?

I have two 3090s though using only one atm. I guess NVlinking them would make no difference with inference, and all I might get is more VRAM (48GB). And with QLoRA that does not matter as much now, does it?

Python is single-core by default because of the GIL. At its heart, it's an inherent Python limitation. So, the best CPU in this case is one that can run a single, non-multithreaded application as fast as possible.

You can run different prompts on different cores, but they would have to scheduled serially on the GPU, so limited value there. NVlinking two 3090s just gives you a fast interconnect between GPUs, but they are still two distinct GPUs. You don't get 48 GB VRAM for free without doing additional work to split your model across GPUs as far as I'm aware. A real 48 GB VRAM GPU can run bigger models much faster than 2x 24 GB VRAM GPUs. I can't run the 65B 4-bit GPTQ Guanaco model for example on my 2x 4090s at any reasonable speed (only get 1-2 tokens/second.) If you want to run inference or train the 65B model using QLoRA, you'll need a real 48 GB VRAM GPU.

Regarding multiple prompts: that's something you might do if you were processing data in bulk. For example, if you were wanted an LLM to summarise or write replies to 1000 emails, or summarise articles for you, or whatever. That sort of thing. It's not relevant to the average user who wants to do ChatGPT style chatting.

Here's some example code that makes use of that:

def pipeline(self, prompts, batch_size=1):
        if not self.pipe:
            # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
            logging.set_verbosity(logging.CRITICAL)
            self.pipe = pipeline(
                "text-generation",
                model=self.model,
                tokenizer=self.tokenizer,
                generation_config=self.generation_config,
                device=self.device
            )
        self.update_seed()
        answers = []
        with self.do_timing(True) as timing:
            with torch.no_grad():
                # TODO: batch_size >1 causes gibberish output, investigate
                output = self.pipe(prompts, return_tensors=True, batch_size=batch_size)

            for index, gen in enumerate(output):
                tokens = gen[0]['generated_token_ids']
                input_ids, len_input_ids = self.encode(prompts[index])
                len_reply = len(tokens) + 1 - len_input_ids
                response = self.decode(tokens)
                reply_tokens = tokens[-len_reply:]
                reply = self.tokenizer.decode(reply_tokens)

                result = {
                    'response': response,   # The response in full, including prompt
                    'reply': reply,         # Just the reply, no prompt
                    'len_reply': len_reply, # The length of the reply tokens
                    'seed': self.seed,      # The seed used to generate this response
                    'time': timing['time']  # The time in seconds to generate the response
                }
                answers.append(result)

        return answers

With batch_size set to X (where X is >1), and with prompts being a List of multiple prompts, this will process them X prompts at a time. This enables using 100% of the GPU in situations where a single prompt would use only a fraction of that.

There are some complexities though. If you run that code as-is with a bunch of varied prompts, you will likely find that the outputs are partially gibberish. At least that's what I found.

In order for it to work properly, the prompts have to be padded to all be the same length. I never got as far as writing code to do that, but there are examples in the Hugging Face docs and elsewhere.

If that's done correctly the result should be much faster performance, despite the single-core performance limit.

But again, it doesn't really help for the average use case of just wanting to infer one prompt at a time.

Bah humbug...well A6000 is a bit out of reach for me. :)

Sounds like that QLoRA will be the only saving grace for single consumer GPU setups, coupled with the latest generation CPUs and a fair amount of RAM.

As an aside question, what should I be seeing my CPU/GPU usage be during inference?

Right now with both GPTQ and GGML models I see < 7% use on either CPU or GPU, is that normal?

@LoneStriker

By the way, have you seen this: https://github.com/ggerganov/llama.cpp/pull/1607 - might be worth testing on your dual 4090.

Bah humbug...well A6000 is a bit out of reach for me. :)

Sounds like that QLoRA will be the only saving grace for single consumer GPU setups, coupled with the latest generation CPUs and a fair amount of RAM.

As an aside question, what should I be seeing my CPU/GPU usage be during inference?

Right now with both GPTQ and GGML models I see < 7% use on either CPU or GPU, is that normal?

CPU is generally always 100% on at least one core for gptq inference. I don't usually use ggml as it's slower than gptq models by a factor of 2x using GPU. On my box with Intel 13900K CPU, the 4090 is running at 100%. For my box with AMD 3700X, the 3090 only gets to 60-75% GPU. So I'm bottlenecked on the AMD box by the slower CPU.

@LoneStriker

By the way, have you seen this: https://github.com/ggerganov/llama.cpp/pull/1607 - might be worth testing on your dual 4090.

Thanks for the reference. I think llama.cpp will be a serious contender in the future. But it currently can't beat gptq if your model fits in VRAM. It's faster though when it doesn't fit, but the inference speed on a 65B model on LLaMA.cpp is too slow to be useful by an order of magnitude.

I guess I'm seriously bottlenecked by the Xeon E5v3 then but llama.cpp is getting better and better. GPTQ might not have a future for me in that case, as it won't be worth while for me to upgrade to a latest gen CPU. It's hard to tell when so many things are in motion at the moment and new improvements are coming out every day (QLoRA for example).

Or maybe what's bottlenecking me even more is Windows and WSL.

Yeah WSL is definitely bottlenecking you I'm afraid, based on what i've heard from several people using it.

I'm having no problem with WSL besides that Windows only gives it half the system RAM

@ehartford

Have you increased the available RAM to WSL via the .wslconfig file. I wrote about that in another post here on HF - by creating C:\Users\[username]\.wslconfig you can give it more RAM, more cores, etc if needed, but remember that it needs a full restart of the service for the changes to take effect.

Here's mine for example:

[wsl2]
memory=128GB
swap=0
localhostForwarding=true
processors=16

EDIT: had to edit the post couple of times because backslashes were missing. My WSL2 is in Windows 10, maybe it works better in Windows 11?

To the question about llamacpp. I tried https://github.com/SciSharp/LLamaSharp. Under Visual Studio C# it works fine with GPU. Also uses all CPU cores. It was possible to use it in Unity, although crutches were required, since Unity uses NET 4.2 and LLamaSharp NET 6.0.
This is much more convenient for development than linking an application to an HTTP server in Python. Unfortunately, LLamaSharp does not yet support all the models that interest me, but the approach itself is interesting.

Sign up or log in to comment