Nexesenex/Mistral-Large-Instruct-2407-bf16-iMat-CQ-GGUF

Custom Quants for MistralAI Mistral Large v2 123b

IQ4_XXSR, basically IQ4_XS with attn_q in IQ3_S, and attn_v in Q6_K, and token_embed in Q6_0. Yes, you did read correctly, the last traditional quant of Ikawrakow, not available on Llama.cpp mainline.

IQ3_XXSR_144L : A 3.01bpw quant, compatible with dual GPU GPU setups (with Croco.cpp, 28k context in kv q6/q5 bbs64, full offload).

It hits 3.44 ppl512 in English, and 3.18 ppl512 in French (a custom and more textual dataset, but it shows at usage), so 3.46/3.20 with quantized cache q6/q5 or even q51/q5. Probably the best quant you can get in gguf for a dual GPU setup at middle-range context.

IQ2_XL_143L : A 2.88bpw quant. Same features, PPL512 eng is 3.53, PPL512 fr is 3.21. 44k+ context in kv q51/iq4nl bbs64.

IQ2_L_144L : A 2.79bpw quant. Same features, PPL512 eng is 3.63, PPL512 fr is 3.25. 56k+ context in kv q51/iq4nl bbs64.

IQ2_MR_144L : A 2.66bpw quant. Same features, PPL512 eng is 3.80, PPL512 fr is 3.29. 70k+ context in kv q51/iq4nl bbs64.

IQ2_SR_144L : A 2.58bpw quant. Same features, PPL512 eng is 3.87, PPL512 fr is 3.32. 80k+ context in kv q51/iq4nl bbs64.

-> These last quants are also almost perfectly symetrical for 2 GPU with ts 44-45, and 4 GPUS (for example 4 RTX 3060, 4060ti, or A4000) with ts 22,22,22,23). To achieve that, I shrunk a little bit the quantization of some of the last 25% of the layers to match the size of the Q6_K output_weight. -> Also, these quants of course keep an ARC easy and challenge result in line with higher quants (Arc-C 50+, Arc-E 70+)

WARNING : Quants with Q6_0 embeddings are compatible with IK_Llama.cpp and Croco.cpp (my fork of the great KoboldCpp) only. I'll release .exe soon, but it works already (at least on Windows) for those who can compile. https://github.com/Nexesenex/croco.cpp

Overall, maybe it's time for the Llama.cpp team to have a look at Ikawrakow's last work and offer terms of cooperation with him, so we can enjoy once again SOTA quants in Llama.cpp. https://github.com/ikawrakow/ik_llama.cpp

Because the situation is becoming grotesque : we are quantizing massively models with non-SOTA quants while there is better in reach. Thousands of terabytes of storage space, our compute and our time is wasted because of this situation.