Guidance on GPU VRAM Split?

by nmitchko - opened Feb 7, 2024

Feb 7, 2024

Hi,

Thank you for the merge, this is a very cool model with nice performance.

I am currently on 2xA40, is there an optimal VRAM split that will best optimize for performance? I am getting pretty slow TK/s, but I guess that's expected. Any tips would be interesting

wolfram

Owner Feb 9, 2024

I'm splitting 22,24 over 2x3090's 48 GB VRAM. I generally fill the second GPU completely and take as much as necessary from the first.

kikosama

Feb 26, 2024

I'm splitting 22,24 over 2x3090's 48 GB VRAM. I generally fill the second GPU completely and take as much as necessary from the first.

Are you talking about the quantized model or the full fp16 model? I assume the former?

nmitchko

Feb 27, 2024

Wolfram's got to be referring to a quantized version, perhaps the most aggressive quants?

nmitchko

Feb 27, 2024

To be clear I am also running a quant, but bpw 6.0

wolfram

Owner Feb 27, 2024

Yes, quantized. I run e. g. wolfram_miquliz-120b-v2.0-3.0bpw-h6-exl2 with a 22,24 GPU split for 4K context at 10 tokens/s.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment