Beinsezii commited on
Commit
2f8fcab
1 Parent(s): 31b68fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -12
README.md CHANGED
@@ -6,14 +6,14 @@ Quantizations for [Gryphe/MythoMax-L2-13b](https://huggingface.co/Gryphe/MythoMa
6
 
7
  Quant|Mem 4k|Mem 4k8|Δp|test wiki|test pippa
8
  ----|--|--|--|--|--
9
- [4k_hb8_b8_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb8_b8_pippa)|18GB||-0.0183|5.7781|4.4221
10
- [4k_hb6_b5_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b5_pippa)|13GB||-0.0197|6.0141|4.4252
11
- [2k_hb8_b8_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb8_b8_pippa)|18GB||-0.0246?|5.7762|4.4238
12
- [2k_hb6_b5_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb6_b5_pippa)|13GB||-0.0121|6.0823|4.4363
13
- [4k_hb8_b8](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb8_b8)|18GB|||5.7459|4.4247
14
  [4k_hb6_b6](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b6)|15GB
15
- [4k_hb6_b5](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b5)|13GB|||5.7699|4.4514
16
- [2k_hb8_b8](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb8_b8)|18GB|||5.7572|4.4242
17
  [2k_hb6_b4.125](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb6_b4.125)
18
 
19
 
@@ -33,9 +33,7 @@ Additional analysis:
33
  - 4k 5bit pippa is substantially worse than 2k 5bit pippa relative to 4k 8bit pippa and 2k 8bit pippa. Larger calibration lengths (or possibly at least more rows) is likely strongly preferable for aggressive quants
34
  - Dropping from 8 to 5 bits results in a 4% increase in wiki perplexity on the pippa calibrated quants while only resulting in 0.2% for the wiki calibrated
35
  - Both calibrations increase <0.1% on 8 to 5 bits when tested against pippa chat.
36
- With this rather superficial methodology, wikitext with 4k settings seems like the safest general purpose quant speed/quality tradeoff.
37
- However, real world usage would probably favor picking datasets at least tangentially related to the tune;
38
- all pippa calibrated sets performed fairly better on pippa test than their wiki calibrated counterparts.
39
 
40
  Detractors:
41
  - Real world usage may produce more visible differences than a hundreth of a percent on a small test
@@ -43,12 +41,14 @@ Detractors:
43
  - I have not tested increasing row count on the final quant for the same measurement
44
  - It's unclear if or how Δp should be used
45
 
46
- Given a (current) lack of real-world testing, I can only conclude that wikitext 2k is a safe default, but it should be worth considering other datasets for hyper specialized chat/story/etc models.
 
 
47
 
48
  All quantizations were calibrated with [wikitext-2](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) unless otherwise specified
49
 
50
  You can run a model calibrated at 2k with a 4k context or vice versa. The actual difference between 2k and 4k calibrations appears to be very small.
51
 
52
- VRAM estimates are performed with an extremely long chatlog in [oobabooga webui](https://github.com/oobabooga/text-generation-webui) on a 7900 XTX using [nvtop](https://github.com/Syllo/nvtop) to monitor **pytorch usage only**, rounded up. Systems with lots of extra background processes may use more. Additionally, NVIDIA based systems with [flash attention 2](https://github.com/Dao-AILab/flash-attention) **will use less VRAM** than otherwise estimated.
53
 
54
  The measurement files are provided in the main branch so you can [make your own quants](https://github.com/turboderp/exllamav2/blob/master/doc/convert.md) at other bit depths without going through the 2-3 hours of measuring.
 
6
 
7
  Quant|Mem 4k|Mem 4k8|Δp|test wiki|test pippa
8
  ----|--|--|--|--|--
9
+ [4k_hb8_b8_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb8_b8_pippa)|17.2GB|15.7GB|-0.0183|5.7781|4.4221
10
+ [4k_hb6_b5_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b5_pippa)|12.6GB|11.0GB|-0.0197|6.0141|4.4252
11
+ [2k_hb8_b8_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb8_b8_pippa)|17.2GB|15.7GB|-0.0246?|5.7762|4.4238
12
+ [2k_hb6_b5_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb6_b5_pippa)|12.6GB|11.0GB|-0.0121|6.0823|4.4363
13
+ [4k_hb8_b8](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb8_b8)|17.2GB|15.7GB||5.7459|4.4247
14
  [4k_hb6_b6](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b6)|15GB
15
+ [4k_hb6_b5](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b5)|12.6GB|11.0GB||5.7699|4.4514
16
+ [2k_hb8_b8](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb8_b8)|17.2GB|15.7GB||5.7572|4.4242
17
  [2k_hb6_b4.125](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb6_b4.125)
18
 
19
 
 
33
  - 4k 5bit pippa is substantially worse than 2k 5bit pippa relative to 4k 8bit pippa and 2k 8bit pippa. Larger calibration lengths (or possibly at least more rows) is likely strongly preferable for aggressive quants
34
  - Dropping from 8 to 5 bits results in a 4% increase in wiki perplexity on the pippa calibrated quants while only resulting in 0.2% for the wiki calibrated
35
  - Both calibrations increase <0.1% on 8 to 5 bits when tested against pippa chat.
36
+
 
 
37
 
38
  Detractors:
39
  - Real world usage may produce more visible differences than a hundreth of a percent on a small test
 
41
  - I have not tested increasing row count on the final quant for the same measurement
42
  - It's unclear if or how Δp should be used
43
 
44
+ With this rather superficial methodology, wikitext with 4k settings seems like the safest general purpose quant speed/quality tradeoff.
45
+ However, real world usage would probably favor picking datasets more related to the tune;
46
+ all pippa calibrated sets performed fairly better on pippa test than their wiki calibrated counterparts.
47
 
48
  All quantizations were calibrated with [wikitext-2](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) unless otherwise specified
49
 
50
  You can run a model calibrated at 2k with a 4k context or vice versa. The actual difference between 2k and 4k calibrations appears to be very small.
51
 
52
+ VRAM estimates are performed with an extremely long chatlog in [oobabooga webui](https://github.com/oobabooga/text-generation-webui) on a 7900 XTX using [nvtop](https://github.com/Syllo/nvtop) to monitor **pytorch usage only**. Systems with lots of extra background processes may use more. Additionally, NVIDIA based systems with [flash attention 2](https://github.com/Dao-AILab/flash-attention) **will use less VRAM** than otherwise estimated.
53
 
54
  The measurement files are provided in the main branch so you can [make your own quants](https://github.com/turboderp/exllamav2/blob/master/doc/convert.md) at other bit depths without going through the 2-3 hours of measuring.