YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Custom Quants for MistralAI Mistral Large v2 123b

IQ4_XXSR, basically IQ4_XS with attn_q in IQ3_S, and attn_v in Q6_K, and token_embed in Q6_0. Yes, you did read correctly, the last traditional quant of Ikawrakow, not available on Llama.cpp mainline.

IQ3_XXSR_144L : A 3.01bpw quant, compatible with dual GPU GPU setups (with Croco.cpp, 28k context in kv q6/q5 bbs64, full offload).

It hits 3.44 ppl512 in English, and 3.18 ppl512 in French (a custom and more textual dataset, but it shows at usage), so 3.46/3.20 with quantized cache q6/q5 or even q51/q5. Probably the best quant you can get in gguf for a dual GPU setup at middle-range context.

IQ2_XL_143L : A 2.88bpw quant. Same features, PPL512 eng is 3.53, PPL512 fr is 3.21. 44k+ context in kv q51/iq4nl bbs64.

IQ2_L_144L : A 2.79bpw quant. Same features, PPL512 eng is 3.63, PPL512 fr is 3.25. 56k+ context in kv q51/iq4nl bbs64.

IQ2_MR_144L : A 2.66bpw quant. Same features, PPL512 eng is 3.80, PPL512 fr is 3.29. 70k+ context in kv q51/iq4nl bbs64.

IQ2_SR_144L : A 2.58bpw quant. Same features, PPL512 eng is 3.87, PPL512 fr is 3.32. 80k+ context in kv q51/iq4nl bbs64.

-> These last quants are also almost perfectly symetrical for 2 GPU with ts 44-45, and 4 GPUS (for example 4 RTX 3060, 4060ti, or A4000) with ts 22,22,22,23). To achieve that, I shrunk a little bit the quantization of some of the last 25% of the layers to match the size of the Q6_K output_weight. -> Also, these quants of course keep an ARC easy and challenge result in line with higher quants (Arc-C 50+, Arc-E 70+)

WARNING : Quants with Q6_0 embeddings are compatible with IK_Llama.cpp and Croco.cpp (my fork of the great KoboldCpp) only. I'll release .exe soon, but it works already (at least on Windows) for those who can compile. https://github.com/Nexesenex/croco.cpp

Overall, maybe it's time for the Llama.cpp team to have a look at Ikawrakow's last work and offer terms of cooperation with him, so we can enjoy once again SOTA quants in Llama.cpp. https://github.com/ikawrakow/ik_llama.cpp

Because the situation is becoming grotesque : we are quantizing massively models with non-SOTA quants while there is better in reach. Thousands of terabytes of storage space, our compute and our time is wasted because of this situation.

Downloads last month
156
GGUF
Model size
123B params
Architecture
llama

2-bit

3-bit

16-bit

Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.