This repository contains improved Mixtral-8x7B quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out-of-the-box.

The table shows a comparison between these models and the current llama.cpp quantization approach using Wikitext perplexities for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(int8))/PPL(int8). Running the full fp16 Mixtral8x7b model on the systems I have available takes too long, so I'm comparing against the 8-bit quantized model, where I get PPL = 4.1049. From past experience the 8-bit quantization should be basically equivalent to fp16.

Quantization Model file PPL(llama.cpp) Quantization Error PPL(new quants) Quantization Error
Q2_K mixtral-8x7b-q2k.gguf 7.4660 81.9% 5.0576 23.2%
Q3_K_S mixtral-8x7b-q3k-small.gguf 4.4601 8.65% 4.3848 6.82%
Q3_K_M mixtral-8x7b-q3k-medium.gguf 4.4194 7.66% 4.2884 4.47%
Q4_K_S mixtral-8x7b-q4k-small.gguf 4.2523 3.59% 4.1764 1.74%
Q4_K_M mistral-8x7b-q4k-medium.gguf 4.2523 3.59% 4.1652 1.47%
Q5_K_S mixtral-7b-q5k-small.gguf 4.1395 0.84% 4.1278 0.56%
Q4_0 mixtral-8x7b-q40.gguf 4.2232 2.88% 4.2001 2.32%
Q4_1 mistral-8x7b-q41.gguf 4.2547 3.65% 4.1713 1.62%
Q5_0 mistral-8x7b-q50.gguf 4.1426 0.92% 4.1335 0.70%
Downloads last month
36
GGUF
Model size
46.7B params
Architecture
llama
Inference API
Unable to determine this model's library. Check the docs .