Could this model be loaded in 3090 GPU?

#6
by Exterminant - opened

Can't load this model into my 3090, wonder if someone managed how to do this? 13B models work well.
Tried to play with pre-layers params, none of them are working
When try to load the model, get this error:
Traceback (most recent call last): File “C:\Users\konstantin\Desktop\oobabooga_windows\text-generation-webui\server.py”, line 68, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File “C:\Users\konstantin\Desktop\oobabooga_windows\text-generation-webui\modules\models.py”, line 85, in load_model output = load_func(model_name) File “C:\Users\konstantin\Desktop\oobabooga_windows\text-generation-webui\modules\models.py”, line 259, in GPTQ_loader model = load_quantized(model_name) File “C:\Users\konstantin\Desktop\oobabooga_windows\text-generation-webui\modules\GPTQ_loader.py”, line 175, in load_quantized model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, shared.args.pre_layer) File “C:\Users\konstantin\Desktop\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py”, line 236, in load_quant model.load_state_dict(safe_load(checkpoint)) File “C:\Users\konstantin\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py”, line 2041, in load_state_dict raise RuntimeError(‘Error(s) in loading state_dict for {}:\n\t{}’.format( RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM: size mismatch for model.layers.0.self_attn.k_proj.qzeros: copying a param with shape torch.Size([1, 832]) from checkpoint, the shape in current model is torch.Size([52, 832]). size mismatch for model.layers.0.self_attn.k_proj.scales: copying a param with shape torch.Size([1, 6656]) from checkpoint, the shape in current model is torch.Size([52, 6656]). size mismatch for model.layers.0.self_attn.o_proj.qzeros: copying a param with shape torch.Size([1, 832]) from checkpoint, the shape in current model is torch.Size([52, 832]). size mismatch for model.layers.0.self_attn.o_proj.scales: copying a param with shape torch.Size([1, 6656]) from checkpoint, the shape in current model is torch.Size([52, 6656]). size mismatch for model.layers.0.self_attn.q_proj.qzeros: copying a param with shape torch.Size([1, 832]) from checkpoint, the shape in current model is torch.Size([52, 832]). size mismatch for model.layers.0.self_attn.q_proj.scales: copying a param with shape torch.Size([1, 6656]) from checkpoint, the shape in current model is torch.Size([52, 6656]). size mismatch for

edit: After re-reading your error, double check to make sure you actually fully downloaded the model by checking the file size on your disk vs what is in huggingface. I think I got a similar error initially because the model didn't fully download in one shot.

I'm not getting this error, but also having trouble loading this but in my 4090. I have GPTQ wbits set to 4, none for group, llama for model type, pre-layers doesn't seem to do much. What seems to happen is it tries to load the model fully in RAM and is not loading it into the 4090 VRAM at all. Am I doing something wrong? It essentially runs out of memory then says "press any key to continue" without any other messages. This is my first time trying to load a GPTQ model as well, I figured using wbit 4 would allow plenty of memory space to load the 33B model.

Yeah it can be loaded. That's an error with GPTQ parameters

Firstly, double check that the GPTQ parameters are set and saved for this model:

  • bits = 4
  • group_size = None
  • model_type = Llama

If they are, then you might be hitting a text-generation-webui bug. In that case please edit models/config-user.yaml and find the entry for TheBloke_guanaco-33B-GPTQ and see if groupsize is set to 128. If it is, set it to -1 then save the file and re-load text-generation-webui

Try that and report back

edit: After re-reading your error, double check to make sure you actually fully downloaded the model by checking the file size on your disk vs what is in huggingface. I think I got a similar error initially because the model didn't fully download in one shot.

I'm not getting this error, but also having trouble loading this but in my 4090. I have GPTQ wbits set to 4, none for group, llama for model type, pre-layers doesn't seem to do much. What seems to happen is it tries to load the model fully in RAM and is not loading it into the 4090 VRAM at all. Am I doing something wrong? It essentially runs out of memory then says "press any key to continue" without any other messages. This is my first time trying to load a GPTQ model as well, I figured using wbit 4 would allow plenty of memory space to load the 33B model.

Yeah this is a common problem. I'm not sure why it happens exactly, but basically you are right - it needs to load it fully into RAM first, then it moves to VRAM. It seems to be exclusive to Windows, and if you have plenty of RAM, you may still get this issue.

Fortunately the solution is simple: increase your Windows pagefile size, eg to around 90GB. This has solved the problem for others who have reported it.

That did the trick for my issue - it was set to 'None' but explicitly setting it to -1 allowed it to eventually offload onto the GPU VRAM. Thanks!

Thank you so much TheBloke, I did as you said and it now works for me!

Here's my config-user.yaml for this model, if someone will need it:
TheBloke_guanaco-33B-GPTQ$:
auto_devices: false
bf16: false
cpu: false
cpu_memory: 0
disk: false
gpu_memory_0: 0
groupsize: 128
load_in_8bit: false
mlock: false
model_type: llama
n_batch: 512
n_gpu_layers: 0
no_mmap: false
pre_layer: 0
threads: 0
wbits: 4

Great!

I'm assuming this is obvious but i'd like to state all of these changes do not allow it to work on a 3080.

“oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py”, line 844, in _apply self._buffers[key] = fn(buf) File “oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py”, line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 10.00 GiB total capacity; 9.16 GiB already allocated; 0 bytes free; 9.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I find it that GPTQ is slower than GGML.

I can load a q5_0 into my 3090 and I get 6-7 t/s, while with GPTQ only 4-5 t/s.

I'm assuming this is obvious but i'd like to state all of these changes do not allow it to work on a 3080.
Yeah I'm afraid that's correct - 30B models need 24GB VRAM to fully load a GPTQ, so 12GB VRAM is not enough.

You can use offloading to load half the model in RAM instead. But that will be really slow, much slower than using a GGML and offloading as many layers as possible to the GPU.

I can load a q5_0 into my 3090 and I get 6-7 t/s, while with GPTQ only 4-5 t/s.

Wow that's really interesting. When I tested this a week or so ago, I found GPTQ was still around twice the performance of GGML, as long as it was possible to load the full model into VRAM.

But I don't think I tested with 30B. And also llama.cpp has had a bunch of further improvements since then.

One factor is CPU single core speed. pytorch inference (ie GPTQ) is single-core bottlenecked. So if you have a lot of cores but with a low maximum clock speed, this bottlenecks GPU inference. Whereas llama.cpp is multi-threaded and might not be bottlenecked in the same way.

I will have to test again!

I'm assuming this is obvious but i'd like to state all of these changes do not allow it to work on a 3080.
Yeah I'm afraid that's correct - 30B models need 24GB VRAM to fully load a GPTQ, so 12GB VRAM is not enough.

You can use offloading to load half the model in RAM instead. But that will be really slow, much slower than using a GGML and offloading as many layers as possible to the GPU.

This is interesting, I mean I use 128gb of ram and a 5950x CPU so it'd be nice one day to be able to use my hardware on a model like this.

@MachineLURN

I think with your 3080 and the CPU/RAM setup you should be getting a decent performance, based on what TheBloke said about single threaded performance of GPTQ and looking at these charts: https://www.cpubenchmark.net/singleThread.html

That OOM error is something I had in the past too and IIRC the issue was with the python environment and versions of packages, not the hardware setup.

i had the same error, and by changing groupsize to -1 I'd managed to finally load the model.
what is groupsize -1?
I was running oobabooga in a wsl2 linux in windows10

I'm able to run it on a 4090, so it should run on a 3090 since they have the same size VRAM. Performance is good on a 4090--around 9 tokens/second typically.

I have a 4090...I'm using the same settings in my config as Exterminant's above, but still won't load for me.. also tried groupsize -1, it just fails to load the model at all then (no error, no anything) I get this with config settings above:

[image redacted by choice]

groupsize -1 is correct

When you say it won't load with that, no error - does it just say "Done" and then close?

If so, you need to increase your Windows pagefile size . Try setting it to 90GB. A lot of people are having this issue on Windows. Even if you have plenty of RAM, you still need a lot of pagefile just to get the model loaded into VRAM.

Ok, for those of you who run WSL2 (I am also running it) you have to adjust its config to give it more resources.

Check if you have: C:\Users<username>.wslconfig

This is what mine looks like (I have 256GB RAM and 24 cores as dual 12-core CPUs, so plenty left for Windows):

[wsl2]
memory=128GB
swap=0
localhostForwarding=true
processors=16

EDIT: by the way I only use 8-9 cores fror GGML models because from my testing (in my environment) beyond that there is no significant improvement.

I find it that GPTQ is slower than GGML.

I can load a q5_0 into my 3090 and I get 6-7 t/s, while with GPTQ only 4-5 t/s.

I'm running a 3090 and Ryzen 7 5800X all at default clock / settings and I'm getting nearly 9 t/s on GPTQ.
Main oobabooga/text-generation-web at commit 28198bc15cc7065b8f4a594f6799ad1be39a209c.
Edit: I am on Ubuntu so if you are in WSL 2 that might explain the gap.

Ryzen 7 probably has better single core performance than my Xeon E5v3s so that would be another reason why you get more t/s.

I'm going to setup a native Linux drive soon and try it out, just need to source another SSD. :)

Ryzen 7 probably has better single core performance than my Xeon E5v3s so that would be another reason why you get more t/s.

I'm going to setup a native Linux drive soon and try it out, just need to source another SSD. :)

Well, I'm getting exactly half of what I get on 13B models (wizard-vicuna etc.), which results (Guanaco 33B) in 12-15 t/s on a single 3090 using GPTQ + monkey patch, i5 13500 with some cheap b760 mobo. I've tested Guanaco couple of days ago on a runpod with 2 x 3090 and I was getting the same results as I'm getting locally o na single GPU, so I was quite shocked. It could be due to the monkey patch + triton version of GPTQ-for-llama (not one-shot installer of text-generation-webui!!). Otherwise, you could consider moving to linux with your project.

I don't think that 2x3090 will perform faster than a single 3090, because inference only happens on a single GPU.

Who knows, maybe there is some setting that could be enabled in the driver to allow full NVLink on these consumer cards. But why would nvidia do that, when that's not good for business...

I've read that with 2 cards, they both have the same data in memory, unfortunately, I don't have a way to verify it, but it seems logical. On the other hand, I've heard that it's possible to combine graphics cards with a 20% memory loss so that they have different data in memory. For example, is it possible to divide the 65B version into 3 parts and distribute them among 3 x RTX 4060 16GB VRAM with a total capacity of 48VRAM?
Is it even possible to divide the data like that or develop software that transfers computation results between the cards without involving RAM and works solely on the GPU?

One can imagine that 10x RTX 4060 16GB VRAM would provide 160GB of RAM - with a 20% loss, it would give 128GB of VRAM for large models and consume only about 1200-1500 watts of power. It would be slower but cheaper than H100. Sorry if my question is pointless :)

I think you can split the model across multiple cards by specifying --auto-devices in text-generation-webui, or when using python, when loading the model setting device_map='auto', but inference will still happen on a single GPU.

I see there is also some kind of a bug in text-gen-webui atm regarding multiple devices, that requres the equal amount of RAM: https://github.com/oobabooga/text-generation-webui/issues/2543

Hey TheBloke !
Thanks for sharing and following up !
I'm not having any issues with 13B models you shared, I just followed instructions by aitrepreneur and went smooth.

But now I'm trying to get models + 30b working, and everytime -(same for gpt4xalpaca 30b for example) I get this cuda out of memory error.

Tried to allocate 58.00 MiB (GPU 0; 16.00 GiB total capacity; 15.09 GiB already allocated; 0 bytes free; 15.28 GiB reserved in total by PyTorch)

I'm on RTX 3080 but I have 64GB ram and 32GB gpu memory .. I just don't understand why it doesn't find the ressources for that.

Since I'm on windows, I also tried your "paging settings" but didn't change anything. I'm not good at all this, but one thing I am thinking is that since paging sounds to be using "disk space", maybe I should free some space to make it more efficient ?...

Any help would be appreciated !!

A 30B model is too big for a 16GB card. You need a 24GB card to fully load a 30B GPTQ model.

It is technically possible to load it. If you're using text-generation-ui, then you can do that by setting the "GPU Memory" slider to around 10GB. That will load 10Gb of model, and then allow 6GB for context. The rest will go to RAM. It will be slow as hell though.

If you really want to use a 30B model, I recommend you use a GGML version instead. GGML with CUDA acceleration performs much better when you don't have enough VRAM to fully load the model.

Sign up or log in to comment