Very slow speed with oobabooga

#2
by CyberTimon - opened

Hello! Just downloaded this model. I started the webui with "python3 server.py --model wizardlm-7b-4bit-128g --gpu-memory 12 --wbits 4 --groupsize 128 --model_type llama --listen-host 0.0.0.0 --listen --extensions api --xformers --listen-port 21129"

But I get only 2-3 Tokens / sec. With vicuna 1.1 13b 4bit I get 30 tokens per second. (I use the same command)
I have a rtx 3060 12gb.

GPU util is 100% on both models and I have the oobabooga gpt q

Same here.

I'm making a new model, created with ooba's old fork of GPTQ-for-LLaMa. I'll ping you when it's ready for testing.

@CyberTimon and @ElvisM can you please try new file: https://huggingface.co/TheBloke/wizardLM-7B-GPTQ/blob/main/wizardLM-7B-GPTQ-4bit-128g.ooba.no-act-order.safetensors

Make sure it's the only .safetensors file in your existing wizardLM-7B-GPTQ directory and then run text-generation-webui as before

Let me know how the speed is.

Thank you I test it!

@CyberTimon and @ElvisM can you please try new file: https://huggingface.co/TheBloke/wizardLM-7B-GPTQ/blob/main/wizardLM-7B-GPTQ-4bit-128g.ooba.no-act-order.safetensors

Make sure it's the only .safetensors file in your existing wizardLM-7B-GPTQ directory and then run text-generation-webui as before

Let me know how the speed is.

Still no luck here for me. That's really odd because all the other 7b 4bit models work fine for me.

Output generated in 8.12 seconds (1.35 tokens/s, 11 tokens, context 39, seed 1913998095)
Output generated in 11.47 seconds (0.96 tokens/s, 11 tokens, context 66, seed 627847145)

Can I ask - those "other 7b 4bit" models, are any of them safetensors files? Or are they all .pt files?

OK I'm trying one last thing - making a .pt file instead of safetensors. It'll be uploaded in a minute and I'll ping you when ready.

Thank you!
llava-13b-4bit-128g is also safetensors. But yes, every other one is .pt

I converted a model like this: python llama.py ./chimera-7b c4 --wbits 4 --true-sequential --groupsize 128 --save chimera7b-4bit-128g.pt

@CyberTimon and @ElvisM can you please try new file: https://huggingface.co/TheBloke/wizardLM-7B-GPTQ/blob/main/wizardLM-7B-GPTQ-4bit-128g.ooba.no-act-order.safetensors

Make sure it's the only .safetensors file in your existing wizardLM-7B-GPTQ directory and then run text-generation-webui as before

Let me know how the speed is.

Yeah still slow for me too

@CyberTimon @ElvisM OK please try this file: https://huggingface.co/TheBloke/wizardLM-7B-GPTQ/blob/main/wizardLM-7B-GPTQ-4bit-128g.ooba.no-act-order.pt

As before, make sure it's the only model file in the wizardLM-7B-GPTQ directory

Ahh it's still extremly slow. Btw I need to have the configs and tokenizer models etc in the directory, otherwise oobabooga gives an error

Yeah you need all the json files and tokenizer.model, I just meant don't have any other .pt or .safetensors files.

In which case sorry, I'm out of ideas. Don't know what else to try, or why it would be so slow for you guys but fine for me on Linux. I don't know of any reason for that to be the case that would vary by model.

As a sanity check: have you recently run one of the other 7B models, and confirmed it's still running at a normal speed?

Yes I have. Other 7b and 13b models. All work very very fast. No idea why this generates so much trouble

I quantize it myself - if it still makes issues, then I think it has something to do with the model delta's provided by the wizardlm autors

Yeah this makes no sense to me. Here is me doing inference on the .pt file I just uploaded, from the command line using ooba's fork of GPTQ-for-LLaMa:

root@c14ca38d7f61:~/ooba-gptq-llama# time CUDA_VISIBLE_DEVICES=0 python llama_inference.py --wbits 4 --groupsize 128 --load /workspace/wizard-gptq/wizardLM-7B-GPTQ-4bit-128g.ooba.no-act-order.pt --text "Llamas are " --min_length 100 --max_length 250 --temperature 0.7 /workspace/wizard-gptq
Loading model ...
Done.
<s> Llamas are 36-inch tall creatures that are native to South America. They are known for their long, luxuriant coats, which can be woven into cloth.
Llamas are social animals that live in groups called herds. They are grazers and spend most of their days eating grass. Llamas have a keen sense of smell, which they use to detect predators.
Llamas are domesticated animals that have been used for their wool for thousands of years. They are often raised for their meat as well.
Llamas are known for their calm, gentle nature and can be trained as pack animals. They are often used in therapeutic programs for people with disabilities.</s>

real	0m28.305s
user	0m27.235s
sys	0m14.359s

This should be the exact same code you're using. And as you see, it's fine. I don't know what that is in tokens/s but it's 111 words so that's at least 5 tokens/s, probably more given that time above includes the time to load the model, not just inference.

Can you try the same test , from the command line/Powershell?

I'm also making one more .pt file. I had a problem with the CUDA kernel on my ooba GPTQ-for-LLaMa. I don't think it affected the quantisation (as you see it infers fine above), but I'm re-doing the file just to be sure.

I somehow can't quantize it myself. It uses more than 12gb vram. I successfully quantized other 7b llama models so I more and more think this model have some problems

Can I ask - those "other 7b 4bit" models, are any of them safetensors files? Or are they all .pt files?

OK I'm trying one last thing - making a .pt file instead of safetensors. It'll be uploaded in a minute and I'll ping you when ready.

Other than the WizardLM model, I have two models using the safetensors format and two in .pt. They both still work fine (I just tested them). Maybe it's something to do with GPTQ since it's been a while that textgen-webui is still not using the version which allows Vicuna in safetensors to work properly (it outputs gibberish). I don't know why oobaboga still didn't update the installer so that it uses the latest GPTQ.

I somehow can't quantize it myself. It uses more than 12gb vram. I successfully quantized other 7b llama models so I more and more think this model have some problems

Yeah it needs a tiny bit over 12GB - 12347 MiB is the highest I've seen so far (12.058GB)

This is while re-running GPTQ to a .pt file with ooba-gptq-llama:

timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.max, pcie.link.gen.current, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/04/26 13:00:02.777, NVIDIA GeForce RTX 4090, 00000000:C1:00.0, 525.78.01, P2, 4, 4, 70, 100 %, 56 %, 24564 MiB, 12489 MiB, 11727 MiB
2023/04/26 13:00:07.779, NVIDIA GeForce RTX 4090, 00000000:C1:00.0, 525.78.01, P2, 4, 4, 44, 14 %, 0 %, 24564 MiB, 12489 MiB, 11727 MiB
2023/04/26 13:00:12.780, NVIDIA GeForce RTX 4090, 00000000:C1:00.0, 525.78.01, P2, 4, 4, 69, 100 %, 55 %, 24564 MiB, 13169 MiB, 11047 MiB
2023/04/26 13:00:17.781, NVIDIA GeForce RTX 4090, 00000000:C1:00.0, 525.78.01, P2, 4, 4, 46, 13 %, 1 %, 24564 MiB, 14131 MiB, 10085 MiB
2023/04/26 13:00:22.783, NVIDIA GeForce RTX 4090, 00000000:C1:00.0, 525.78.01, P2, 4, 4, 70, 100 %, 59 %, 24564 MiB, 13295 MiB, 10921 MiB
2023/04/26 13:00:27.784, NVIDIA GeForce RTX 4090, 00000000:C1:00.0, 525.78.01, P2, 4, 4, 70, 100 %, 57 %, 24564 MiB, 12845 MiB, 11371 MiB

I don't think there's anything to suggest that there's anything "wrong" with the model. It's just different in some way.

It's working perfectly fine (and doing very well for a 7B) in HF, GGML and GPTQ formats for me. There's just something unusual/different causing it not to work for you guys as a GPTQ on Windows.

Loading model ...
Done.
<s> Llamas are 3-4 feet tall at the shoulder and can weigh up to 200 pounds. They have long, curved necks and short, sturdy legs. They have a coat of fur that can be brown, black, or white, with the males being larger than the females. Llamas are social animals and live in groups called herds. They communicate with each other through a variety of sounds, including bleats, bellows, and growls. Llamas are primarily grazers and can eat up to 60 pounds of vegetation per day. They have a digestive system similar to humans, with a cecum (a pouch at the beginning of the large intestine) that helps break down plant material. Llamas are native to the Andean Mountains of South America and were domesticated by the Incas over 5,000 years ago. Today, they are primarily raised for their wool, which is soft and warm, and is used in clothing, insulation, and other products.</s>```

It didn't print the time to execute but it was like 3-4 min for this text with your .pt file

as a GPTQ on Windows.

I'm on linux 20.04 btw

as a GPTQ on Windows.

I'm on linux 20.04 btw

Oh! Well then I'm even more confused

Are you using the latest Triton GPTQ-for-LLaMa code? git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa in text-generation-webui/repositories ?

No I'm using the oobabooga fork as the triton was slower than the oobabooga's one. Also I'm on 22.04 not 20.04 as I wrote wrong before.

With llama/vicuna 7b 4bit I get incredible fast 41 tokens/s on a rtx 3060 12gb.
Output generated in 8.27 seconds (41.23 tokens/s, 341 tokens, context 10, seed 928579911)

No I'm using the oobabooga fork as the triton was slower than the oobabooga's one. Also I'm on 22.04 not 20.04 as I wrote wrong before.

Please try the Triton branch and let me know:

cd text-generation-webui/repositories
mv GPTQ-for-LLaMa ../ooba-gptq-llama
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa

If it's still slow then this I suppose this must be a GPU-specific issue, and not as I thought OS/installation specific.

Still slow + every other model is now also just 10 tokens / sec instead of 40 tokens / sec so I stay with ooba's fork

Yeah OK I see what you mean now. Vicuna 7B for example is way faster and has significantly lower GPU usage %.

CUDA ooba GPTQ-for-LlaMa - Vicuna 7B no-act-order.pt:
image.png
Output generated in 33.70 seconds (15.16 tokens/s, 511 tokens, context 44, seed 1738265307)

CUDA ooba GPTQ-for-LlaMa - WizardLM 7B no-act-order.pt:
image.png
Output generated in 113.65 seconds (4.50 tokens/s, 511 tokens, context 44, seed 2135854730)

Triton QwopQwop GPTQ-for-LlaMa - WizardLM 7B no-act-order.pt:
image.png
Output generated in 46.00 seconds (11.11 tokens/s, 511 tokens, context 44, seed 532347497)

VRAM usage is pretty similar. 5243 MiB for Vicuna, 5663 MiB for WizardLM.

This is really weird. I'm not sure what it could possibly be. I might raise an issue with the GPTQ devs and see if they have thoughts.

(As an aside, I'm amazed you're getting 40 token/s on a 3060 when I'm peaking at 15 token/s on a 4090! I guess the cloud system I'm using must be bottlenecked on something else.)

Good to know that you also have performance issues. Yes I think it would be nice if the gpt q devs would be informed.

(As an aside, I'm amazed you're getting 40 token/s on a 3060 when I'm peaking at 15 token/s on a 4090! I guess the cloud system I'm using must be bottlenecked on something else.)

Yes it's very nice but also a bit strange. dolly-3b (Yes just the 3b model) in full fp16 precision gives only 14 tokens / sec while llama 7b 4bit gives 41 tokens / sec and 13b 4bit gives 25 tokens / sec.
Have you tried to use oobabooga's gpt q fork as only in this fork I get 41 tokens / sec?
I don't have a display connected to the gpu btw.

Have you tried to use oobabooga's gpt q fork as only in this fork I get 41 tokens / sec?
I don't have a display connected to the gpu btw.

Yeah those figures I described above as 'CUDA ooba GPTQ-for-LLaMa' are using ooba's fork, on a 4090 24GB. And this is a cloud system so it wouldn't have a display connected either.

It is possible it's CPU bottlenecked or something like that. It's a cloud pod so the host is shared amongst multiple users. The GPU is dedicated to me but I think it's possible that CPU and RAM performance can be impacted by other users.

I will do some testing on another system and see what I find there.

Okay nice! I have a i5 13600kf btw

Just to add my two cents, I have the same slowness problem on my 3060, I run Windows.

I was able to get it to run faster on my system. In the config.json file, I set "use_cache" to true. Went from 1.8 tokens/s to 11.5 tokens/s

Running a 2080 Ti on windows.

I was able to get it to run faster on my system. In the config.json file, I set "use_cache" to true. Went from 1.8 tokens/s to 11.5 tokens/s

Running a 2080 Ti on windows.

Setting "use_cache" to true fixed it for me, nothing else worked.

I'm running a 1080 Ti on windows.

I was able to get it to run faster on my system. In the config.json file, I set "use_cache" to true. Went from 1.8 tokens/s to 11.5 tokens/s

Running a 2080 Ti on windows.

Worked for me too! Thanks!

This was the solution! Thanks so much to @IonizedTexasMan for figuring this out

I've updated config.json in the repo and am making a note in the README for anyone who previously downloaded it.

With llama/vicuna 7b 4bit I get incredible fast 41 tokens/s on a rtx 3060 12gb.
Output generated in 8.27 seconds (41.23 tokens/s, 341 tokens, context 10, seed 928579911)

This is incredibly fast, I never achieved anything above 15 it/s on a 3080ti. I tested with vicuna 7b also. Would you mind detailing your setup? Did you get Wizard 7b running at this speed?

Thank you! Back on my: Output generated in 2.24 seconds (41.48 tokens/s, 93 tokens, context 3, seed 1232882170)

@BazsiBazsi , I run it on linux ubuntu 22.04. I use --xformers and run it with the original oobabooga gpt q branch. The latest one is slow for me.

Hi @CyberTimon
I have tried alot of permutations & combinations, with multiple 4bit-7B models, configs.json params, switch back and forth between the latest and ooba's fork of GPTQ, but i cant get more than 12-15 tokens/sec. I even scrapped everything and did it all from scratch. No luck, max 12-15 tokens/sec. If you are getting 3-4X better performance, may i request you to write a step-by-step guide with exact version and commands to replicate this performance?

I'm on ubuntu 22.04, with intel i7-11370H, 16GB RAM & RTX-3060 6GB laptop
Thanks!

Yeah I must say I am still extremely confused as to how CyberTimon gets 40 token/s on a 3060 when I get max 16 token/s on a 4090, and others I've spoken to are in the same ballpark

What have you done to unlock this power!? :)

My gpu has superpowers!
No for real - no idea. Here is how I set up everything:

  1. Built my pc (used as a headless server) with 2x rtx 3060 12gb (1 running stable diffusion, the other one oobabooga)
  2. Installed a clean ubuntu 22.04.
  3. Installed the latest linux nvidia drivers perhaps this is the fix?
  4. Download Jupyter Lab as this is how I controll the server
  5. Install the latest oobabooga and quant cuda. (To get gpt q working)
  6. Download any llama based 7b or 13b model. (Without act-order but with groupsize 128)
  7. Open text generation webui from my laptop which i started with --xformers and --gpu-memory 12
  8. Profit (40 tokens / sec with 7b and 25 tokens / sec with 13b model)
    Can you name your setup ? Are you running wsl or native?

In addition I have a i5 13600kf and 64gb ddr4 ram. The model is on a 1tb kingston ssd

When I'm back home I can also make a video and share my cuda / driver versions.

My gpu has superpowers!
No for real - no idea. Here is how I set up everything:

  1. Built my pc (used as a headless server) with 2x rtx 3060 12gb (1 running stable diffusion, the other one oobabooga)
  2. Installed a clean ubuntu 22.04.
  3. Installed the latest linux nvidia drivers perhaps this is the fix?
  4. Download Jupyter Lab as this is how I controll the server
  5. Install the latest oobabooga and quant cuda. (To get gpt q working)
  6. Download any llama based 7b or 13b model. (Without act-order but with groupsize 128)
  7. Open text generation webui from my laptop which i started with --xformers and --gpu-memory 12
  8. Profit (40 tokens / sec with 7b and 25 tokens / sec with 13b model)
    Can you name your setup ? Are you running wsl or native?

In addition I have a i5 13600kf and 64gb ddr4 ram. The model is on a 1tb kingston ssd

Hey it's a bit late but i figured it out. You gotta launch ooba with --autogptq. I'm getting above 20 tokens / s with a 3080ti. Ran on linux.

Sign up or log in to comment