|
Custom Quants for MistralAI Mistral Large v2 123b |
|
|
|
IQ4_XXSR, basically IQ4_XS with attn_q in IQ3_S, and attn_v in Q6_K, and token_embed in Q6_0. |
|
Yes, you did read correctly, the last traditional quant of Ikawrakow, not available on Llama.cpp mainline. |
|
|
|
IQ3_XXSR_144L : A 3.01bpw quant, compatible with dual GPU GPU setups (with Croco.cpp, 28k context in kv q6/q5 bbs64, full offload). |
|
|
|
It hits 3.44 ppl512 in English, and 3.18 ppl512 in French (a custom and more textual dataset, but it shows at usage), so 3.46/3.20 with quantized cache q6/q5 or even q51/q5. |
|
Probably the best quant you can get in gguf for a dual GPU setup at middle-range context. |
|
|
|
IQ2_XL_143L : A 2.88bpw quant. Same features, PPL512 eng is 3.53, PPL512 fr is 3.21. 44k+ context in kv q51/iq4nl bbs64. |
|
|
|
IQ2_L_144L : A 2.79bpw quant. Same features, PPL512 eng is 3.63, PPL512 fr is 3.25. 56k+ context in kv q51/iq4nl bbs64. |
|
|
|
IQ2_MR_144L : A 2.66bpw quant. Same features, PPL512 eng is 3.80, PPL512 fr is 3.29. 70k+ context in kv q51/iq4nl bbs64. |
|
|
|
IQ2_SR_144L : A 2.58bpw quant. Same features, PPL512 eng is 3.87, PPL512 fr is 3.32. 80k+ context in kv q51/iq4nl bbs64. |
|
|
|
IQ2_XSR_144 : A 2.45bpw quant. Same features, PPL512 eng is 4.07, PPL512 fr is 3.36. 95k+ context in kv q51/iq4nl bbs64. |
|
|
|
-> These last quants are also almost perfectly symetrical for 2 GPU with ts 44-45, and 4 GPUS (for example 4 RTX 3060, 4060ti, or A4000) with ts 22,22,22,23). |
|
To achieve that, I shrunk a little bit the quantization of some of the last 25% of the layers to match the size of the Q6_K output_weight. |
|
-> Also, these quants of course keep an ARC easy and challenge result in line with higher quants (Arc-C 50+, Arc-E 70+) |
|
|
|
WARNING : |
|
Quants with Q6_0 embeddings are compatible with IK_Llama.cpp and Croco.cpp (my fork of the great KoboldCpp) only. I'll release .exe soon, but it works already (at least on Windows) for those who can compile. |
|
https://github.com/Nexesenex/croco.cpp |
|
|
|
Overall, maybe it's time for the Llama.cpp team to have a look at Ikawrakow's last work and offer terms of cooperation with him, so we can enjoy once again SOTA quants in Llama.cpp. |
|
https://github.com/ikawrakow/ik_llama.cpp |
|
|
|
Because the situation is becoming grotesque : we are quantizing massively models with non-SOTA quants while there is better in reach. |
|
Thousands of terabytes of storage space, our compute and our time is wasted because of this situation. |