File size: 1,525 Bytes
3cd7fbf
 
 
b2c06e8
 
 
f6d5df3
749c4b1
 
 
b2c06e8
749c4b1
 
 
b960622
749c4b1
 
 
 
8bda1e3
f6d5df3
24279c5
f6d5df3
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
---
license: apache-2.0
---
Update: User (@concendo) asked if these were pre/post the 4/3 update to llama.cpp, everything was reqauntized with 4/18 version of llama.cpp since I wasn't sure. 

Note: qx-k-m quants are not as good as the qx-0, something about the 'k' process doesn't play nice with mixtral. 


These are the quantized GGUF files for [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).

They were converted from Mistral's safetensors and quantized on April 18, 2024. 
This matters because some of the GGUF files for Mixtral 8x7B were created as soon as llama.cpp supported MoE architecture, but there were still bugs at that time. 
Those bugs have since been patched. 

These are here for reference, comparison, and any future work. 

The quality of the llamafiles generated from these freshly converted GGUFs were noticeably better than those generated from the other GGUFs on HF. 

These three were most interesting because:
- q3-k-m: can fit entirely on a 4090 (24GB VRAM), very fast inference
- q4-0: for some reason, this is better quality than q4-k-m.
- q4-k-m: the widely accepted standard as "good enough" and general favorite for most models, but in this case it does not fit on a 4090
- q5-0: * recommended * for some reason, this is better quality than q5-k-m.
- q5-k-m: my favorite for smaller models, larger - provides a reference for "what if you have more than just a bit that won't fit on the gpu"
- q6-k: lower perplexity, but I don't like the output style