Text Generation
Transformers
Safetensors
English
llama
text-generation-inference
4-bit precision

Mininum VRAM?

#9
by hierholzer - opened

Hello,
When trying to load (main), I am getting an error code of 137; which typically means out of memory.
I have 48GB VRAM (2 * 3090TI w NVLink), as well as 256GB system Ram
Can I configure this in some sort of way to make it work with 48GB VRAM, Or can I use system memory in addition to my VRAM in GPTQ versions?

Here is the Current configuration that I am using,
CurrentSetup.png

Thanks for your help

Yes you can split it over the two GPUs. But you need to choose the ExLlama loader, not Transformers. Then in the "gpu-split" box enter "17.2,24" to put 17.2GB on GPU1 and 24GB on GPU2 (GPU1 needs room for context also hence it needs to load less of the model).

Then save the settings and reload the model with them

Expect performance to be in the region of 10 - 15 tokens/s

Thanks for the help!
FYI - if anyone out there is having the same issues that I was, AND the page layout on the Model Tab looks the same as my screenshot from above.
Then you need to update your text-generation-webui git repo.
Afterward, your Model Tab screen will look like this:
NewScreen.png

Thanks for all of the awesome work you have contributed to the community TheBloke.

Cheers!

Thanks for the help!
FYI - if anyone out there is having the same issues that I was, AND the page layout on the Model Tab looks the same as my screenshot from above.
Then you need to update your text-generation-webui git repo.
Afterward, your Model Tab screen will look like this:
NewScreen.png

Thanks for all of the awesome work you have contributed to the community TheBloke.

Cheers!

I would use max_seq_leng = 4096 and set compress_pos_emb maybe to 2, because that's what the base model is fine tuned to.

Curious if anyone uses Exllama in a notebook, seems mostly use cases in textgeneration ui.

Yes you can split it over the two GPUs. But you need to choose the ExLlama loader, not Transformers. Then in the "gpu-split" box enter "17.2,24" to put 17.2GB on GPU1 and 24GB on GPU2 (GPU1 needs room for context also hence it needs to load less of the model).

Then save the settings and reload the model with them

Expect performance to be in the region of 10 - 15 tokens/s

@TheBloke Is there any general guidance on how to figure these split number out? Or is it based on field experience?

That specific number for 65B and 70B on ExLlama is taken from the README of the ExLlama Github, so it's based on the testing of the ExLlama developer himself: https://github.com/turboderp/exllama#dual-gpu-results

Generally speaking, the rule is:

  • The first GPU needs to leave some room for context - 6-7 GB is a common figure
  • The second (and more, if present) GPU(s) can use their full capacity, ie 24GB

This general rule applies for AutoGPTQ as well. ExLlama needs less VRAM for context than AutoGPTQ, but it still needs to leave room for that.

One big difference between ExLlama and AutoGPTQ (and all other Transformers-based methods) is that ExLlama pre-allocates VRAM for context, up to whatever the configured maximum sequence length is. So the model either loads fine and will then work at the full context length, or it fails to load with an "out of VRAM" CUDA error, immediately.

With AutoGPTQ and other Transformers-based code, it only loads the model weights at model load, and then uses VRAM for context as it's needed. So with AutoGPTQ, you might get the model loaded and get some responses, and then find it runs out of VRAM when context goes over a certain threshold.

Hello,
When trying to load (main), I am getting an error code of 137; which typically means out of memory.
I have 48GB VRAM (2 * 3090TI w NVLink), as well as 256GB system Ram
Can I configure this in some sort of way to make it work with 48GB VRAM, Or can I use system memory in addition to my VRAM in GPTQ versions?

Here is the Current configuration that I am using,
CurrentSetup.png

Thanks for your help

Hi, I'm considering buying an nvlink to link a couple of A5000's. Would you share your experience with nvlink ? Does it make a difference ? Does exllama support loading using it (assuming you don't have the same PCI BW on both 3090's , that is)
Thanks!

Sign up or log in to comment