VRAM Estimates

by ernestr - opened Mar 5, 2024

Mar 5, 2024

Thanks so much for your reviews and merges!

Could you provide estimates of VRAM usage for the EXL2 quants given varying context size e.g. 16k & 32k (or point me to steps to allow me to calculate myself given the specific tokenizer and the size of the repo)?

wolfram

Owner Mar 5, 2024

I put that information on the EXL2 versions' model cards:

Max Context w/ 48 GB VRAM: (24 GB VRAM is not enough, even for 2.4bpw, use GGUF instead!)

2.4bpw: 32K (32768 tokens) w/ 8-bit cache, 21K (21504 tokens) w/o 8-bit cache
2.65bpw: 30K (30720 tokens) w/ 8-bit cache, 15K (15360 tokens) w/o 8-bit cache
3.0bpw: 12K (12288 tokens) w/ 8-bit cache, 6K (6144 tokens) w/o 8-bit cache

ernestr

Mar 6, 2024

•

edited Mar 6, 2024

Thanks! In case folks are curious about 64 GB of VRAM, here is where I maxed out with 3.5bpw. I'm getting ~3-4t/s with minimal context in cache.

3.5bpw: 20K (20,000 tokens) w/ 8-bit cache

invictus1

Mar 16, 2024

•

edited Mar 16, 2024

At 5.0bpw with 4bit cache and full context I'm using 76.8gb of ram and its generating at 11-13t/s. This is with a A100 80gb. Also Wolfram I absolutely love this model thank you so much for making something this godly!

wolfram

Owner Mar 16, 2024

Thanks, guys, for all of this information. And now I want an A100, too! ;)

I'm happy how it turned out, but didn't do much besides merging and converting and quantizing the already godly components others provided. But I'm glad you like it so much! :)

cloudyu

Apr 2, 2024

I bought a mac m2 ultra with 192G ram recently. Can the EXL2 versions' model run on mac?

Adzeiros

Jun 4, 2024

I put that information on the EXL2 versions' model cards:

Max Context w/ 48 GB VRAM: (24 GB VRAM is not enough, even for 2.4bpw, use GGUF instead!)

2.4bpw: 32K (32768 tokens) w/ 8-bit cache, 21K (21504 tokens) w/o 8-bit cache

2.65bpw: 30K (30720 tokens) w/ 8-bit cache, 15K (15360 tokens) w/o 8-bit cache

3.0bpw: 12K (12288 tokens) w/ 8-bit cache, 6K (6144 tokens) w/o 8-bit cache

Just curious, I have a dual 3090 setup, I cannot run 3.0bpw on it at all. Even with 8k context on 4bit cache... Any tips on how I can get it to work?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment