Nexesenex
/

Mistral-Large-Instruct-2407-bf16-iMat-CQ-GGUF

Inference Endpoints

Model card Files Files and versions Community

Mistral-Large-Instruct-2407-bf16-iMat-CQ-GGUF / README.md

Nexesenex's picture

Update README.md

5bc82e0 verified 29 days ago

|

history blame contribute delete

2.4 kB

	Custom Quants for MistralAI Mistral Large v2 123b

	IQ4_XXSR, basically IQ4_XS with attn_q in IQ3_S, and attn_v in Q6_K, and token_embed in Q6_0.
	Yes, you did read correctly, the last traditional quant of Ikawrakow, not available on Llama.cpp mainline.

	IQ3_XXSR_144L : A 3.01bpw quant, compatible with dual GPU GPU setups (with Croco.cpp, 28k context in kv q6/q5 bbs64, full offload).

	It hits 3.44 ppl512 in English, and 3.18 ppl512 in French (a custom and more textual dataset, but it shows at usage), so 3.46/3.20 with quantized cache q6/q5 or even q51/q5.
	Probably the best quant you can get in gguf for a dual GPU setup at middle-range context.

	IQ2_XL_143L : A 2.88bpw quant. Same features, PPL512 eng is 3.53, PPL512 fr is 3.21. 44k+ context in kv q51/iq4nl bbs64.

	IQ2_L_144L : A 2.79bpw quant. Same features, PPL512 eng is 3.63, PPL512 fr is 3.25. 56k+ context in kv q51/iq4nl bbs64.

	IQ2_MR_144L : A 2.66bpw quant. Same features, PPL512 eng is 3.80, PPL512 fr is 3.29. 70k+ context in kv q51/iq4nl bbs64.

	IQ2_SR_144L : A 2.58bpw quant. Same features, PPL512 eng is 3.87, PPL512 fr is 3.32. 80k+ context in kv q51/iq4nl bbs64.

	IQ2_XSR_144 : A 2.45bpw quant. Same features, PPL512 eng is 4.07, PPL512 fr is 3.36. 95k+ context in kv q51/iq4nl bbs64.

	-> These last quants are also almost perfectly symetrical for 2 GPU with ts 44-45, and 4 GPUS (for example 4 RTX 3060, 4060ti, or A4000) with ts 22,22,22,23).
	To achieve that, I shrunk a little bit the quantization of some of the last 25% of the layers to match the size of the Q6_K output_weight.
	-> Also, these quants of course keep an ARC easy and challenge result in line with higher quants (Arc-C 50+, Arc-E 70+)

	WARNING :
	Quants with Q6_0 embeddings are compatible with IK_Llama.cpp and Croco.cpp (my fork of the great KoboldCpp) only. I'll release .exe soon, but it works already (at least on Windows) for those who can compile.
	https://github.com/Nexesenex/croco.cpp

	Overall, maybe it's time for the Llama.cpp team to have a look at Ikawrakow's last work and offer terms of cooperation with him, so we can enjoy once again SOTA quants in Llama.cpp.
	https://github.com/ikawrakow/ik_llama.cpp

	Because the situation is becoming grotesque : we are quantizing massively models with non-SOTA quants while there is better in reach.
	Thousands of terabytes of storage space, our compute and our time is wasted because of this situation.