Mix for 96gb ram
How would you build a quant for 96gb rtx pro 6000? Would you offload more - or rebuild a quant?
Recently upgraded from 3090 to rtx pro 6000. I could build on my machine if that saves you some trouble.
You can use the existing quants hybrid GPU+CPU with your 96GB VRAM. Just use more -ot ...=CUDA0 to load up additional layers into VRAM with the remaining left on for CPU/RAM.
I don't think it is possible to shrink the full 671B weights to fit in under 96GB total RAM/VRAM given if you compress the original fp8 (8bpw) weights down to 1bpw (1 bit per weight):
671B * (1/8) = 83.875GB
It would probably not have enough smarts left-over to be useful, and likely output gibberish, though it would run quickly on your GPU hah..
What kinda CPU/RAM do you have paired with that GPU?
Also with that much fast VRAM you can look into https://github.com/turboderp-org/exllamav3 as the EXL3 quantizations there are quite high quality and good for pure GPU inference. You could easily run a ~6BPW ~70B dense model at ~6bpw without any noticeable quality loss with enough context to do concurrent/parallel inferencing or higher batches for more throughput.
My normal command:
llama-server -m /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf --ctx-size 32768 -mla 3 -fa -amb 2048 -fmoe --temp 0.3 --min-p 0.05 -ngl 99 --parallel 1 --threads 32 -b 2048 -ub 2048 --host 0.0.0.0 --port 7080 -ot exps=CPU
gives
INFO [ print_timings] prompt eval time = 103501.83 ms / 11567 tokens ( 8.95 ms per token, 111.76 tokens per second) | tid="139826191020032" timestamp=1750619344 id_slot=0 id_task=0 t_prompt_processing=103501.832 n_prompt_tokens_processed=11567 t_token=8.948027319097433 n_tokens_second=111.75647596266703
INFO [ print_timings] generation eval time = 164877.01 ms / 2076 runs ( 79.42 ms per token, 12.59 tokens per second) | tid="139826191020032" timestamp=1750619344 id_slot=0 id_task=0 t_token_generation=164877.014 n_decoded=2076 t_token=79.42052697495183 n_tokens_second=12.59120328319386
INFO [ print_timings] total time = 268378.85 ms | tid="139826191020032" timestamp=1750619344 id_slot=0 id_task=0 t_prompt_processing=103501.832 t_token_generation=164877.014 t_total=268378.846
and using 90gb:
llama-server -m /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf --ctx-size 32768 -mla 3 -fa -amb 2048 -fmoe --temp 0.3 --min-p 0.05 -ngl 99 --parallel 1 --threads 32 -b 2048 -ub 2048 --host 0.0.0.0 --port 7080 -ot "blk.([3-9]|1[0-2]).ffn_.*=CUDA0" -ot exps=CPU
INFO [ print_timings] prompt eval time = 89379.86 ms / 11567 tokens ( 7.73 ms per token, 129.41 tokens per second) | tid="140636983169024" timestamp=1750618958 id_slot=0 id_task=0 t_prompt_processing=89379.859 n_prompt_tokens_processed=11567 t_token=7.72714264718596 n_tokens_second=129.41394324643093
INFO [ print_timings] generation eval time = 152694.00 ms / 2086 runs ( 73.20 ms per token, 13.66 tokens per second) | tid="140636983169024" timestamp=1750618958 id_slot=0 id_task=0 t_token_generation=152693.996 n_decoded=2086 t_token=73.19942281879196 n_tokens_second=13.661309905073148
INFO [ print_timings] total time = 242073.86 ms | tid="140636983169024" timestamp=1750618958 id_slot=0 id_task=0 t_prompt_processing=89379.859 t_token_generation=152693.996 t_total=242073.855
It does perform a little better. If there is not much to be gained beyond that - good to know then! Have been using your deepseek-v3 quant for go to model for a while. Wanted to make sure there wasn't any huge gains to be had.
Thanks
If I read this correctly, looks like you're getting about 13.6 tok/sec generation speed with the IQ4_K_R4 which is pretty good! Your bottleneck will be the remaining activated weights on CPU/RAM e.g. given 37B active parameters and maybe you have let's say 20B of those not offloaded to GPU and say you have 256GB/s RAM bandwidth on a threadripper pro that would be a theoretical max of ~13 tok/sec.
So you must be offloading more or have a rather fast RAM bandwidth system?
Two other discussions that may be of interest:
It is running with 768gb ddr5 5600. Seems I am hitting limits of RAM, which makes sense.
If I were to make a new quant using your example for IQ4_KS_R4, would it make sense to set all moe experts to IQ5_KS_R4, and perhaps put first 0-6 dense layers on GPU?
Wonder if I should just try Q8_0.
Will read through those links.
Thanks!
If I were to make a new quant using your example for IQ4_KS_R4, would it make sense to set all moe experts to IQ5_KS_R4, and perhaps put first 0-6 dense layers on GPU?
Make sense how? Is your goal to increase token generation speed? Or is your goal to maximize quality (low perplexity)?
Already the ffn_down_exps routed expert MoE layers are IQ5_KS_R4 in this quant. The ffn_(gate|up)_exps are IQ4_KS_R4. This is a common trade-off strategy to preserve quality while increasing token generation.
If you wanted to optimize speed I'd suggest using the non _R4 quant types as they recently got a big boost for PP in sufficiently large batches for CPU. The non _R4 likely are somewhat faster on CUDA as well given they dont' have to be un-repacked. Check out this https://github.com/ikawrakow/ik_llama.cpp/pull/531#issuecomment-2978436076 for details.
So basically you could use my recipe for th IQ4_KS_R4 and make a IQ4_S_R4 variant and run with -ub 4096 -b 4096 and might end up doing better given you have plenty of VRAM and RAM.
10+t/s on tg is good. I mostly use it for complex coding stuff. Trying to make it as smart/accurate/as high quality as possible. So after the wait of pp/gen it doesn't make mistakes that would have been prevented with less quantization
Thanks
Oh and if you want me to do any benchmarks for whatever reason let me know.
Okay, I think I understand, you want to get the best quality in terms of accuracy/smart (low perplexity)as possible even if it is a little slower. Check out https://huggingface.co/anikifoss/DeepSeek-R1-0528-DQ4_K_R4/discussions/2 which has marginally better perplexity score and a little more weight than my largest despite not using imatrix. Might be interesting for your use case! They have some benchmarks on there too if you'd like to compare your system!
Did you manage to do this? If yes - can you share performance numbers
Performance numbers in terms of Perplexity, KLD, PP (prompt processing/prefill) speed, TG (token generation) speed, on which exact quant and what kind of hardware?