I can't replicate your speed. Nowhere close.

#1
by 1TBGPU4EVR - opened

3090 on wsl2- i9-12900k, 128gb, cuda 12.1- am I doing something wrong, stupid?

On the HF non quantized model I get about 15t/s

(exllama) root@beaut:/mnt/g/exllama2/exllamav2# python test_inference.py -m /mnt/g/llama.cpp/models/Codellama-34B-instruct-exl2-6bpw/ -p "once upon a time"
-- Model: /mnt/g/llama.cpp/models/Codellama-34B-instruct-exl2-6bpw/
-- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading model...
-- Loading tokenizer...
-- Warmup...
-- Generating...

once upon a time, the whole family went to the circus. My dad was thrilled when they were giving away balloons. I still remember him coming back with one that said “BALLOONS FOR EVERYONE”. It’s such a simple idea, but it has always been my goal to be able to give out things like that to people.
When my dad and I would go on hikes together, he used to say to me, “I want you to take over for me after I die.” That may sound silly to some of you, but it is important to me to live up to

-- Response generated in 177.84 seconds, 128 tokens, 0.72 tokens/second (includes prompt eval.)

Doesn't look like you're doing anything stupid, no. But I'm a little confused as to how you can run the non-quantized model on a 3090, let alone at 15 tokens/second.. ?

But as for the speed with the quantized model, I can only suspect your NVIDIA driver. The later versions of the driver will start swapping VRAM to system RAM if it gets close to running out, which prevents out-of-memory errors but absolutely tanks performance. It seems especially likely since the 6.0bpw model weights alone are larger than the VRAM on the 3090. So it shouldn't even load.

If you mean you have multiple 3090s, you need to define a GPU split with the -gs argument. E.g. -gs 17,24 or something along those lines.

I had my own skeletons to deal with in that env to get cuda all squared up as I tried to use the exllama non-v2 and that wasn't smart- latest cudnn and cuda 12.2- single 3090- 5t/s FWIW- No OOM.
I have weird problems with GPU split when I hard set it- I feel like it's auto-devices or it eats shit early.
Anyway, one person's experience.

BTW I was using my 4090 on the HF model. I'm spoiled from a public HF space that had an A100 on 34B for a while. Easily on par or better than gpt4 coding.

1TBGPU4EVR changed discussion status to closed

Sign up or log in to comment