Mininum VRAM?

by hierholzer - opened Jul 24, 2023

Jul 24, 2023

Hello,
When trying to load (main), I am getting an error code of 137; which typically means out of memory.
I have 48GB VRAM (2 * 3090TI w NVLink), as well as 256GB system Ram
Can I configure this in some sort of way to make it work with 48GB VRAM, Or can I use system memory in addition to my VRAM in GPTQ versions?

Here is the Current configuration that I am using,

Thanks for your help

TheBloke

Owner Jul 24, 2023

Yes you can split it over the two GPUs. But you need to choose the ExLlama loader, not Transformers. Then in the "gpu-split" box enter "17.2,24" to put 17.2GB on GPU1 and 24GB on GPU2 (GPU1 needs room for context also hence it needs to load less of the model).

Then save the settings and reload the model with them

Expect performance to be in the region of 10 - 15 tokens/s

hierholzer

Jul 24, 2023

Thanks for the help!
FYI - if anyone out there is having the same issues that I was, AND the page layout on the Model Tab looks the same as my screenshot from above.
Then you need to update your text-generation-webui git repo.
Afterward, your Model Tab screen will look like this:

Thanks for all of the awesome work you have contributed to the community TheBloke.

Cheers!

streetyogi

Jul 26, 2023

Thanks for the help!
FYI - if anyone out there is having the same issues that I was, AND the page layout on the Model Tab looks the same as my screenshot from above.
Then you need to update your text-generation-webui git repo.
Afterward, your Model Tab screen will look like this:

Thanks for all of the awesome work you have contributed to the community TheBloke.

Cheers!

I would use max_seq_leng = 4096 and set compress_pos_emb maybe to 2, because that's what the base model is fine tuned to.

Satya93

Jul 28, 2023

Curious if anyone uses Exllama in a notebook, seems mostly use cases in textgeneration ui.

Yhyu13

Aug 10, 2023

Yes you can split it over the two GPUs. But you need to choose the ExLlama loader, not Transformers. Then in the "gpu-split" box enter "17.2,24" to put 17.2GB on GPU1 and 24GB on GPU2 (GPU1 needs room for context also hence it needs to load less of the model).

Then save the settings and reload the model with them

Expect performance to be in the region of 10 - 15 tokens/s

@TheBloke Is there any general guidance on how to figure these split number out? Or is it based on field experience?

TheBloke

Owner Aug 10, 2023

•

edited Aug 10, 2023

That specific number for 65B and 70B on ExLlama is taken from the README of the ExLlama Github, so it's based on the testing of the ExLlama developer himself: https://github.com/turboderp/exllama#dual-gpu-results

Generally speaking, the rule is:

The first GPU needs to leave some room for context - 6-7 GB is a common figure
The second (and more, if present) GPU(s) can use their full capacity, ie 24GB

This general rule applies for AutoGPTQ as well. ExLlama needs less VRAM for context than AutoGPTQ, but it still needs to leave room for that.

One big difference between ExLlama and AutoGPTQ (and all other Transformers-based methods) is that ExLlama pre-allocates VRAM for context, up to whatever the configured maximum sequence length is. So the model either loads fine and will then work at the full context length, or it fails to load with an "out of VRAM" CUDA error, immediately.

With AutoGPTQ and other Transformers-based code, it only loads the model weights at model load, and then uses VRAM for context as it's needed. So with AutoGPTQ, you might get the model loaded and get some responses, and then find it runs out of VRAM when context goes over a certain threshold.

tronar

Sep 18, 2023

Hello,
When trying to load (main), I am getting an error code of 137; which typically means out of memory.
I have 48GB VRAM (2 * 3090TI w NVLink), as well as 256GB system Ram
Can I configure this in some sort of way to make it work with 48GB VRAM, Or can I use system memory in addition to my VRAM in GPTQ versions?

Here is the Current configuration that I am using,

Thanks for your help

Hi, I'm considering buying an nvlink to link a couple of A5000's. Would you share your experience with nvlink ? Does it make a difference ? Does exllama support loading using it (assuming you don't have the same PCI BW on both 3090's , that is)
Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment