What's the VRAM usage?

#2
by CamiloMM - opened

I'd like to know what quant I can run, if any, on 24GB, and if so at what context.

I heard people saying the cache of Command-R eats up a lot of memory on GGUF, but dunno if that applies to exl2?

3.75 bpw proved to be too large for me on a 3090, but i'm running Windows and use my GPU for my monitor. I was able to get it to load at 500 tokens so it DOES technically fit. A bit more headroom in the form of Linux etc might make it useable at that bpw.

That said, i'm giving a 3.5 bpw quant available elsewhere and hoping that will be the sweet spot for someone like me.

Since I favor tokens, I'm using 2.6 with 10k tokens. Couldn't go further than that on a 3090.

Since I favor tokens, I'm using 2.6 with 10k tokens. Couldn't go further than that on a 3090.

Isn't that too quantized? To the point where maybe a 8x7b with 32k tokens would be much more coherent? And are you using 4-bit cache?

Since I favor tokens, I'm using 2.6 with 10k tokens. Couldn't go further than that on a 3090.

Isn't that too quantized? To the point where maybe a 8x7b with 32k tokens would be much more coherent? And are you using 4-bit cache?

Yes 4-bit. I believe you are right. I'm trying this model since yesterday and the outputs are not very consistent. To the point older models becomes better options.
I'm kinda new to this world. I still don't understand everything.

On 3x3060:

  • 4.0 bpw: 24576 ctx, 7-7-12 split, 4bit cache, peak vram usage 11.3, 11.3, 11.5
  • 3.75 bpw: 27000 ctx, 6.4-6.7-12 split, 4bit cache, peak vram usage 11.5, 11.5, 11.5
  • 3.0 bpw: 32768 ctx, 5.5-5.5-12 split, 4bit cache, peak vram usage 11.3, 11.2, 11.4

Better using ExLlamav2 than ExLlamav2_HF, faster and more accurate.

Since I favor tokens, I'm using 2.6 with 10k tokens. Couldn't go further than that on a 3090.

I know this is an old thread but in case anyone sees this, I was able to fit 10752 context (Q4 cache) with the 3.0bpw.
Windows 11, 3090, display running on GPU, was only using 22.4GB VRAM. No reason to use the 2.6bpw quant on a 3090 imo.

Sign up or log in to comment