TheBloke/WizardLM-30B-Uncensored-GPTQ · How much vram+ram 30B needs? I have 3060 12gb + 32gb ram.

May 22, 2023

bump

Owner May 22, 2023

Needs 24GB VRAM to load entirely on the GPU. You can try using text-generation-webui's pre_layer feature to load some layers on the GPU, some on the CPU. try pre_layer 30 as a starting figure.

I can't remember if pre_layer is the number of layers on the CPU, or the number on the GPU. I think it means number of layers on the GPU, so if you get out-of-memory with 30, try decreasing it to 20.

Yhyu13

May 22, 2023

the hf 16fp version requires 63B VRAM. the GPTQ-4bit 128 group size version needs about 25GB, the GPTQ-4bit 1024 group size just fit in 24GB card but ooba has trouble in dealing with 1024 group size though.

The 30B and above versions of LLaMA are pretty unapproachable for commodity devices at this moment.

TheBloke

Owner May 22, 2023

•

edited May 22, 2023

This version used no group size so it will definitely fit in 24gb. I stopped doing 1024 because for 30b will OOM with long responses. Group size none is reliable in 24 though

AllInn

May 22, 2023

I have a 3090 and still getting error about memory when loading it up.

TheBloke

Owner May 22, 2023

Then that must be something else. It loads fine on a 24GB 4090 for me, testing with the ooba GPTQ-for-LLaMA CUDA fork.

Loading the model uses around 18GB VRAM, and then this grows as the response comes back, up to a maximum of 2000 tokens which uses 24203 MiB, leaving 13 MiB free :)

Output generated in 249.28 seconds (8.02 tokens/s, 1999 tokens, context 42, seed 953298877)

timestamp, name, driver_version, pcie.link.gen.max, pcie.link.gen.current, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/05/22 19:40:04.420, NVIDIA GeForce RTX 4090, 525.105.17, 4, 4, 20 %, 16 %, 24564 MiB, 13 MiB, 24203 MiB

Rivaidan

May 22, 2023

I think their problem is normal ram. For me and my 4090 it first loads entirely into my normal ram which maxed out my 32GB, then it shifts to the gpu. So if they don’t have enough system ram I don’t even think it tried to send to gpu

TheBloke

Owner May 22, 2023

Ah yes, that could be it. You generally always need at least as much RAM as you have VRAM.

BGLuck

May 22, 2023

If using ooba, you need a lot of RAM to just load the model (or filepage if you don't have enough RAM), for 65b models I need like 140+GB of RAM (between RAM and pagefile size)

AllInn

May 22, 2023

•

edited May 22, 2023

Interesting, I have 32gb of ram 31.7 usable
Here is the error stack:
Traceback (most recent call last):
File “E:\oobabooga_windows\text-generation-webui\server.py”, line 67, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name)
File “E:\oobabooga_windows\text-generation-webui\modules\models.py”, line 159, in load_model
model = load_quantized(model_name)
File “E:\oobabooga_windows\text-generation-webui\modules\GPTQ_loader.py”, line 178, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
File “E:\oobabooga_windows\text-generation-webui\modules\GPTQ_loader.py”, line 52, in _load_quant
model = AutoModelForCausalLM.from_config(config, trust_remote_code=shared.args.trust_remote_code)
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\auto\auto_factory.py”, line 411, in from_config
return model_class._from_config(config, **kwargs)
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\modeling_utils.py”, line 1146, in _from_config
model = cls(config, **kwargs)
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py”, line 614, in init
self.model = LlamaModel(config)
File “E:\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py”, line 21, in init
super().init(config)
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py”, line 445, in init
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py”, line 445, in
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py”, line 256, in init
self.mlp = LlamaMLP(
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py”, line 151, in init
self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
File “E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\linear.py”, line 96, in init
self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 238551040 bytes.

BGLuck

May 22, 2023

@AllInn that's because not enough RAM. Try to increase your pagefile size, probably 80GB total between RAM + pagesize.

RiggityWrckd

May 23, 2023

These 30b can take over 64gb of your system ram, which is why you need that extra pagefile/swap area. Does everyone just kill their x server and plug into the motherboards hdmi to get their cards vram free? Any tricks would be welcome :)

Yhyu13

May 23, 2023

Is there any existing framework that allows offloading even for GPTQ models? In principle, this should be doable.

TheBloke

Owner May 23, 2023

Yes, check my first response in this thread: pre_layer in GPTQ-for-LLaMa supports offloading. This is supported in the text-generation-webui UI.

Set pre_layer to the number of layers to put on the GPU. There are 60 layers in total in this model. So eg on a 16GB card, you could try --pre_layer 35 to put 35 layers on the GPU and the rest on the CPU. It will be really slow though. If you don't have enough VRAM to fully load the model, I recommend trying a GGML model instead, and load as many layers onto the GPU eg with -ngl 50 to put 50 layers on the GPU (which fits in 16GB VRAM).

With GPTQ, the GPU needs enough VRAM to fit both the model, and the context. With GGML and llama.cpp, GPU offloading stores the model but does not store the context, so you can fit more layers in a given amount of VRAM.

Generally GPTQ is faster than GGML if you have enough VRAM to fully load the model. But if you don't, GGML is now faster - and it can be much faster. Eg testing this 30B model yesterday on a 16GB A4000 GPU, I less than 1 token/s with --pre_layer 38 but 4.5 tokens/s with GGML and llama.cpp with -ngl 50.

Regarding multi-GPU with GPTQ:

In recent versions of text-generation-webui you can also use pre_layer for multi-GPU splitting, eg --pre_layer 30 30 to put 30 layers on each GPU of two GPUs.

mancub

May 23, 2023

Regarding VRAM capacity, remember that if your card is also a primary display unit, it will not have full 24GB available for the model because some portion (~300-500MB) will be used by the OS for display.

Obviously, if you run this headless, or with 2 video cards, then there should be no issues.

Reezlaw

Jun 5, 2023

•

edited Jun 5, 2023

This model takes up about 18GB of VRAM on my 3090. I have auto-devices disabled in Ooba. It fits comfortably on the GPU with some room to spare. System RAM has nothing to do with it (I have 32GB of that).
If you're getting OOM errors on a 24GB card you're probably running some other GPU-intensive program at the same time, otherwise I have no explanation

snakebaconer

Jun 7, 2023

This model takes up about 18GB of VRAM on my 3090. I have auto-devices disabled in Ooba. It fits comfortably on the GPU with some room to spare. System RAM has nothing to do with it (I have 32GB of that).
If you're getting OOM errors on a 24GB card you're probably running some other GPU-intensive program at the same time, otherwise I have no explanation

What settings are you using to load the model? I have the same rig as you and it keeps crashing

mancub

Jun 7, 2023

With the latest text-gen-webui you really don't have to do anything, AutoGPTQ is used automaticall, and unless you specify --triton it'll default to CUDA.

So probably:

python server.py --wbits 4 --groupsize -1 --model_type LLaMA --model

Reezlaw

Jun 7, 2023

Pretty much, but I don't even specify groupsize, then I made sure that auto-devices is unflagged in the UI

snakebaconer

Jun 9, 2023

With the latest text-gen-webui you really don't have to do anything, AutoGPTQ is used automaticall, and unless you specify --triton it'll default to CUDA.

So probably:

python server.py --wbits 4 --groupsize -1 --model_type LLaMA --model

I tried to follow the suggestions you made, but am not sure what I'm still doing wrong. I encounter this error every time:

INFO:Loading TheBloke_WizardLM-30B-Uncensored-GPTQ...
INFO:The AutoGPTQ params are: {'model_basename': 'WizardLM-30B-Uncensored-GPTQ-4bit.act-order', 'device': 'cuda:0', 'use_triton': False, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': None, 'quantize_config': None}
WARNING:The safetensors archive passed at models\TheBloke_WizardLM-30B-Uncensored-GPTQ\WizardLM-30B-Uncensored-GPTQ-4bit.act-order.safetensors does not contain metadata. Make sure to save your model with the save_pretrained method. Defaulting to 'pt' metadata.
Press any key to continue . .

gpusmatter

Jun 11, 2023

Just adding another data point RE: not enough system RAM

I had a similar issue with my setup, where I have more than enough VRAM but wasn't able to load the modal because text-gen-webui keeps running out of system memory (RAM). For me I just had to increase my virtual memory (swap if you on linux). And it fixed things. Also just watching the RAM and VRAM usage while the modal is loaded, I observed that it first loaded the modal (or more likely a part of it) to RAM (and swap because there wasn't enough RAM) and then it would load it to VRAM.

maizied

Aug 11, 2023

Which is better option(A6000 x2 or 3090 x2 SLI) for LLM model?