README.md · ikawrakow/mixtral-8x7b-quantized-gguf at 558135c354381e925c210ad10e5b9ef36fa95dbc

metadata

license: apache-2.0

This repository contains improved Mixtral-8x7B quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out-of-the-box.

The table shows a comparison between these models and the current llama.cpp quantization approach using Wikitext perplexities for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(int8))/PPL(int8). Running the full fp16 Mixtral8x7b model on the systems I have available takes too long, so I'm comparing against the 8-bit quantized model, where I get PPL = 4.1049. From past experience the 8-bit quantization should be basically equivalent to fp16.

Quantization	Model file	PPL(llama.cpp)	Quantization Error	PPL(new quants)	Quantization Error
Q2_K	mixtral-8x7b-q2k.gguf	7.4660	81.9%	5.0576	23.2%
Q3_K_S	mixtral-8x7b-q3k-small.gguf	4.4601	8.65%	4.3848	6.82%
Q3_K_M	mixtral-8x7b-q3k-medium.gguf	4.4194	7.66%	4.2884	4.47%
Q4_K_S	mixtral-8x7b-q4k-small.gguf	4.2523	3.59%	4.1764	1.74%
Q4_K_M	mistral-8x7b-q4k-medium.gguf	4.2523	3.59%	4.1652	1.47%
Q5_K_S	mixtral-7b-q5k-small.gguf	4.1395	0.84%	4.1278	0.56%
Q4_0	mixtral-8x7b-q40.gguf	4.2232	2.88%	4.2001	2.32%
Q4_1	mistral-8x7b-q41.gguf	4.2547	3.65%	4.1713	1.62%
Q5_0	mistral-8x7b-q50.gguf	4.1426	0.92%	4.1335	0.70%