Beinsezii
/

MythoMax-L2-13B-EXL2

English

Model card Files Files and versions Community

Beinsezii commited on Nov 17, 2023

Commit

2f8fcab

•

1 Parent(s): 31b68fb

Update README.md

Browse files

Files changed (1) hide show

README.md +12 -12

README.md CHANGED Viewed

@@ -6,14 +6,14 @@ Quantizations for [Gryphe/MythoMax-L2-13b](https://huggingface.co/Gryphe/MythoMa
 Quant|Mem 4k|Mem 4k8|Δp|test wiki|test pippa
 ----|--|--|--|--|--
-[4k_hb8_b8_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb8_b8_pippa)|18GB||-0.0183|5.7781|4.4221
-[4k_hb6_b5_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b5_pippa)|13GB||-0.0197|6.0141|4.4252
-[2k_hb8_b8_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb8_b8_pippa)|18GB||-0.0246?|5.7762|4.4238
-[2k_hb6_b5_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb6_b5_pippa)|13GB||-0.0121|6.0823|4.4363
-[4k_hb8_b8](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb8_b8)|18GB|||5.7459|4.4247
 [4k_hb6_b6](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b6)|15GB
-[4k_hb6_b5](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b5)|13GB|||5.7699|4.4514
-[2k_hb8_b8](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb8_b8)|18GB|||5.7572|4.4242
 [2k_hb6_b4.125](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb6_b4.125)
@@ -33,9 +33,7 @@ Additional analysis:
  - 4k 5bit pippa is substantially worse than 2k 5bit pippa relative to 4k 8bit pippa and 2k 8bit pippa. Larger calibration lengths (or possibly at least more rows) is likely strongly preferable for aggressive quants
  - Dropping from 8 to 5 bits results in a 4% increase in wiki perplexity on the pippa calibrated quants while only resulting in 0.2% for the wiki calibrated
  - Both calibrations increase <0.1% on 8 to 5 bits when tested against pippa chat.
-With this rather superficial methodology, wikitext with 4k settings seems like the safest general purpose quant speed/quality tradeoff.
-However, real world usage would probably favor picking datasets at least tangentially related to the tune;
-all pippa calibrated sets performed fairly better on pippa test than their wiki calibrated counterparts.
 Detractors:
  - Real world usage may produce more visible differences than a hundreth of a percent on a small test
@@ -43,12 +41,14 @@ Detractors:
  - I have not tested increasing row count on the final quant for the same measurement
  - It's unclear if or how Δp should be used
-Given a (current) lack of real-world testing, I can only conclude that wikitext 2k is a safe default, but it should be worth considering other datasets for hyper specialized chat/story/etc models.
 All quantizations were calibrated with [wikitext-2](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) unless otherwise specified
 You can run a model calibrated at 2k with a 4k context or vice versa. The actual difference between 2k and 4k calibrations appears to be very small.
-VRAM estimates are performed with an extremely long chatlog in [oobabooga webui](https://github.com/oobabooga/text-generation-webui) on a 7900 XTX using [nvtop](https://github.com/Syllo/nvtop) to monitor **pytorch usage only**, rounded up. Systems with lots of extra background processes may use more. Additionally, NVIDIA based systems with [flash attention 2](https://github.com/Dao-AILab/flash-attention) **will use less VRAM** than otherwise estimated.
 The measurement files are provided in the main branch so you can [make your own quants](https://github.com/turboderp/exllamav2/blob/master/doc/convert.md) at other bit depths without going through the 2-3 hours of measuring.

 Quant|Mem 4k|Mem 4k8|Δp|test wiki|test pippa
 ----|--|--|--|--|--
+[4k_hb8_b8_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb8_b8_pippa)|17.2GB|15.7GB|-0.0183|5.7781|4.4221
+[4k_hb6_b5_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b5_pippa)|12.6GB|11.0GB|-0.0197|6.0141|4.4252
+[2k_hb8_b8_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb8_b8_pippa)|17.2GB|15.7GB|-0.0246?|5.7762|4.4238
+[2k_hb6_b5_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb6_b5_pippa)|12.6GB|11.0GB|-0.0121|6.0823|4.4363
+[4k_hb8_b8](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb8_b8)|17.2GB|15.7GB||5.7459|4.4247
 [4k_hb6_b6](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b6)|15GB
+[4k_hb6_b5](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b5)|12.6GB|11.0GB||5.7699|4.4514
+[2k_hb8_b8](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb8_b8)|17.2GB|15.7GB||5.7572|4.4242
 [2k_hb6_b4.125](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb6_b4.125)
  - 4k 5bit pippa is substantially worse than 2k 5bit pippa relative to 4k 8bit pippa and 2k 8bit pippa. Larger calibration lengths (or possibly at least more rows) is likely strongly preferable for aggressive quants
  - Dropping from 8 to 5 bits results in a 4% increase in wiki perplexity on the pippa calibrated quants while only resulting in 0.2% for the wiki calibrated
  - Both calibrations increase <0.1% on 8 to 5 bits when tested against pippa chat.
 Detractors:
  - Real world usage may produce more visible differences than a hundreth of a percent on a small test
  - I have not tested increasing row count on the final quant for the same measurement
  - It's unclear if or how Δp should be used
+With this rather superficial methodology, wikitext with 4k settings seems like the safest general purpose quant speed/quality tradeoff.
+However, real world usage would probably favor picking datasets more related to the tune;
+all pippa calibrated sets performed fairly better on pippa test than their wiki calibrated counterparts.
 All quantizations were calibrated with [wikitext-2](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) unless otherwise specified
 You can run a model calibrated at 2k with a 4k context or vice versa. The actual difference between 2k and 4k calibrations appears to be very small.
+VRAM estimates are performed with an extremely long chatlog in [oobabooga webui](https://github.com/oobabooga/text-generation-webui) on a 7900 XTX using [nvtop](https://github.com/Syllo/nvtop) to monitor **pytorch usage only**. Systems with lots of extra background processes may use more. Additionally, NVIDIA based systems with [flash attention 2](https://github.com/Dao-AILab/flash-attention) **will use less VRAM** than otherwise estimated.
 The measurement files are provided in the main branch so you can [make your own quants](https://github.com/turboderp/exllamav2/blob/master/doc/convert.md) at other bit depths without going through the 2-3 hours of measuring.