VRAM Estimates
Thanks so much for your reviews and merges!
Could you provide estimates of VRAM usage for the EXL2 quants given varying context size e.g. 16k & 32k (or point me to steps to allow me to calculate myself given the specific tokenizer and the size of the repo)?
I put that information on the EXL2 versions' model cards:
Max Context w/ 48 GB VRAM: (24 GB VRAM is not enough, even for 2.4bpw, use GGUF instead!)
- 2.4bpw: 32K (32768 tokens) w/ 8-bit cache, 21K (21504 tokens) w/o 8-bit cache
- 2.65bpw: 30K (30720 tokens) w/ 8-bit cache, 15K (15360 tokens) w/o 8-bit cache
- 3.0bpw: 12K (12288 tokens) w/ 8-bit cache, 6K (6144 tokens) w/o 8-bit cache
At 5.0bpw with 4bit cache and full context I'm using 76.8gb of ram and its generating at 11-13t/s. This is with a A100 80gb. Also Wolfram I absolutely love this model thank you so much for making something this godly!
Thanks, guys, for all of this information. And now I want an A100, too! ;)
I'm happy how it turned out, but didn't do much besides merging and converting and quantizing the already godly components others provided. But I'm glad you like it so much! :)
I bought a mac m2 ultra with 192G ram recently. Can the EXL2 versions' model run on mac?
I put that information on the EXL2 versions' model cards:
Max Context w/ 48 GB VRAM: (24 GB VRAM is not enough, even for 2.4bpw, use GGUF instead!)
- 2.4bpw: 32K (32768 tokens) w/ 8-bit cache, 21K (21504 tokens) w/o 8-bit cache
- 2.65bpw: 30K (30720 tokens) w/ 8-bit cache, 15K (15360 tokens) w/o 8-bit cache
- 3.0bpw: 12K (12288 tokens) w/ 8-bit cache, 6K (6144 tokens) w/o 8-bit cache
Just curious, I have a dual 3090 setup, I cannot run 3.0bpw on it at all. Even with 8k context on 4bit cache... Any tips on how I can get it to work?