|
--- |
|
license: apache-2.0 |
|
--- |
|
Update: User (@concendo) asked if these were pre/post the 4/3 update to llama.cpp, everything was reqauntized with 4/18 version of llama.cpp since I wasn't sure. |
|
|
|
Note: qx-k-m quants are not as good as the qx-0, something about the 'k' process doesn't play nice with mixtral. |
|
|
|
|
|
These are the quantized GGUF files for [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1). |
|
|
|
They were converted from Mistral's safetensors and quantized on April 18, 2024. |
|
This matters because some of the GGUF files for Mixtral 8x7B were created as soon as llama.cpp supported MoE architecture, but there were still bugs at that time. |
|
Those bugs have since been patched. |
|
|
|
These are here for reference, comparison, and any future work. |
|
|
|
The quality of the llamafiles generated from these freshly converted GGUFs were noticeably better than those generated from the other GGUFs on HF. |
|
|
|
These three were most interesting because: |
|
- q3-k-m: can fit entirely on a 4090 (24GB VRAM), very fast inference |
|
- q4-0: for some reason, this is better quality than q4-k-m. |
|
- q4-k-m: the widely accepted standard as "good enough" and general favorite for most models, but in this case it does not fit on a 4090 |
|
- q5-0: * recommended * for some reason, this is better quality than q5-k-m. |
|
- q5-k-m: my favorite for smaller models, larger - provides a reference for "what if you have more than just a bit that won't fit on the gpu" |
|
- q6-k: lower perplexity, but I don't like the output style |