Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,72 @@
|
|
1 |
---
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
base_model: mistral-community/Mixtral-8x22B-v0.1
|
3 |
license: apache-2.0
|
4 |
---
|
5 |
+
|
6 |
+
# Importance-Matrix quantizations of Mixtral-8x22B-v0.1 💫
|
7 |
+
|
8 |
+
the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )
|
9 |
+
|
10 |
+
Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.
|
11 |
+
|
12 |
+
To put the 8bit file back together do
|
13 |
+
|
14 |
+
```
|
15 |
+
cat ~mix4ns.gguf.part* > ~mix4ns.gguf && rm -rf mix4ns.gguf.part*
|
16 |
+
cat ~mix8ns.gguf.part* > ~mix8ns.gguf && rm -rf mix4ns.gguf.part*
|
17 |
+
```
|
18 |
+
careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la
|
19 |
+
|
20 |
+
|
21 |
+
# Perplexity benchmarks
|
22 |
+
|
23 |
+
Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.
|
24 |
+
|
25 |
+
```./perplexity -m ~/mix4xs.gguf -f wiki.test.raw --chunks 12 -t 48 ```
|
26 |
+
|
27 |
+
The results are interesting. quantizing from hf-bf16 folder to f16 gguf adds a bit of loss (increases perplexity).
|
28 |
+
I've noticed on smaller models that going straight from huggingface repo folder to 8bit via using python convert.py --outtype q8_0 produces less perplexity than going hf-f16-q8_0.
|
29 |
+
What's even more interesting is that quantizing TWICE (hf-q8_0 and then q8_0-imatrix) also produces better perplexity compared to regular f16gguf to imatrix.
|
30 |
+
|
31 |
+
All you need to pay attention to is the final value PPL = 2.2585 in this case that of a regular 8bit
|
32 |
+
|
33 |
+
# NOT ALL 8 BIT ARE CREATED EQUAL, this took 9 hours to convert to 8bit on a 64core cpu 256GB-RAM (8channel DDR5)
|
34 |
+
|
35 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/nxezmVBen-tXJdYGYDe2h.png)
|
36 |
+
|
37 |
+
Even though the file is a tiny bit slower, it gets a tiny bit lower perplexity.
|
38 |
+
It looks like nothing here over 12 chunks, and 2.2584-mix8ns vs 2.2585-mix8 regular q8_0
|
39 |
+
but past testing on smaller models and 100+ chunks has shown this difference to be a bit more pronounced
|
40 |
+
```
|
41 |
+
perplexity regular q8_0 (from f16): 126.35 seconds per pass - ETA 6.32 minutes
|
42 |
+
[1]2.6256,[2]3.1043,[3]3.6463,[4]3.2092,[5]2.6847,[6]2.4791,[7]2.3112,[8]2.2502,[9]2.2858,[10]2.2690,[11]2.2693,[12]2.2585,
|
43 |
+
Final estimate: PPL = 2.2585 +/- 0.06534
|
44 |
+
|
45 |
+
perplexity q8_0 (slow convert.py from hf): 96.86 seconds per pass - ETA 4.83 minutes
|
46 |
+
[1]2.6191,[2]3.1045,[3]3.6551,[4]3.2302,[5]2.6990,[6]2.4908,[7]2.3167,[8]2.2541,[9]2.2877,[10]2.2682,[11]2.2685,[12]2.2584,
|
47 |
+
Final estimate: PPL = 2.2584 +/- 0.06514
|
48 |
+
|
49 |
+
perplexity regular iq4_xs (no imatrix): 91.53 seconds per pass
|
50 |
+
[1]2.6966,[2]3.1749,[3]3.6972,[4]3.2577,[5]2.7905,[6]2.6097,[7]2.4536,[8]2.4001,[9]2.4469,[10]2.4219,[11]2.4366,[12]2.4367,
|
51 |
+
Final estimate: PPL = 2.4367 +/- 0.07218
|
52 |
+
|
53 |
+
perplexity regular q4_km (no imatrix): 108.59 seconds per pass
|
54 |
+
[1]2.6100,[2]3.1304,[3]3.6897,[4]3.3500,[5]2.8118,[6]2.5992,[7]2.4349,[8]2.3816,[9]2.4174,[10]2.3959,[11]2.3988,[12]2.3976,
|
55 |
+
Final estimate: PPL = 2.3976 +/- 0.07111
|
56 |
+
|
57 |
+
perplexity EdgeQuant iq4-ns edgequant (no imatrix) 84.45 seconds per pass - FILESIZE 77258 MB
|
58 |
+
[1]2.7195,[2]3.1821,[3]3.7177,[4]3.3017,[5]2.8012,[6]2.6034,[7]2.4318,[8]2.3747,[9]2.4160,[10]2.3931,[11]2.4023,[12]2.4013,
|
59 |
+
Final estimate: PPL = 2.4013 +/- 0.07116
|
60 |
+
|
61 |
+
perplexity EdgeQuant iq4-ns edgequant (WITH imatrix) 75626 MB: 82.76 seconds per pass - FILESIZE 75626 MB ( mix4ns.gguf )
|
62 |
+
[1]2.7175,[2]3.1616,[3]3.6859,[4]3.3012,[5]2.7780,[6]2.5755,[7]2.4093,[8]2.3523,[9]2.3885,[10]2.3757,[11]2.3859,[12]2.3843,
|
63 |
+
Final estimate: PPL = 2.3843 +/- 0.07036
|
64 |
+
```
|
65 |
+
|
66 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/tNd_wHwdbgRGx28ZgBHGq.png)
|
67 |
+
|
68 |
+
command to run these was:
|
69 |
+
```
|
70 |
+
./main -m mix4ns.gguf -n 256 -t 48 --temp 0.5 --color -p "How to build a city on mars via shipping through aldrin cycler orbits?"
|
71 |
+
```
|
72 |
+
|