ikawrakow
/

mixtral-8x7b-quantized-gguf

Inference Endpoints

Model card Files Files and versions Community

mixtral-8x7b-quantized-gguf / README.md

ikawrakow's picture

Update README.md

558135c 10 months ago

|

history blame contribute delete

1.53 kB

	---
	license: apache-2.0
	---

	This repository contains improved Mixtral-8x7B quantized models in GGUF format for use with `llama.cpp`. The models are fully compatible with the oficial `llama.cpp` release and can be used out-of-the-box.

	The table shows a comparison between these models and the current `llama.cpp` quantization approach using Wikitext perplexities for a context length of 512 tokens.
	The "Quantization Error" columns in the table are defined as `(PPL(quantized model) - PPL(int8))/PPL(int8)`.
	Running the full `fp16` Mixtral8x7b model on the systems I have available takes too long, so I'm comparing against the 8-bit quantized model, where I get `PPL = 4.1049`.
	From past experience the 8-bit quantization should be basically equivalent to `fp16`.

	\| Quantization \| Model file \| PPL(llama.cpp) \| Quantization Error \| PPL(new quants) \| Quantization Error \|
	\|--:\|--:\|--:\|--:\|--:\|--:\|
	\|Q2_K \| mixtral-8x7b-q2k.gguf \| 7.4660 \| 81.9% \| 5.0576 \| 23.2% \|
	\|Q3_K_S \| mixtral-8x7b-q3k-small.gguf \| 4.4601 \| 8.65% \| 4.3848 \| 6.82% \|
	\|Q3_K_M\| mixtral-8x7b-q3k-medium.gguf \| 4.4194 \| 7.66% \| 4.2884 \| 4.47% \|
	\|Q4_K_S\| mixtral-8x7b-q4k-small.gguf \| 4.2523 \| 3.59% \| 4.1764 \| 1.74% \|
	\|Q4_K_M\| mistral-8x7b-q4k-medium.gguf \| 4.2523 \| 3.59% \| 4.1652 \| 1.47% \|
	\|Q5_K_S \| mixtral-7b-q5k-small.gguf \| 4.1395 \| 0.84% \| 4.1278 \| 0.56% \|
	\|Q4_0 \| mixtral-8x7b-q40.gguf \| 4.2232 \| 2.88% \| 4.2001 \| 2.32% \|
	\|Q4_1 \| mistral-8x7b-q41.gguf \| 4.2547 \| 3.65% \| 4.1713 \| 1.62% \|
	\|Q5_0 \| mistral-8x7b-q50.gguf \| 4.1426 \| 0.92% \| 4.1335 \| 0.70% \|