--- language: - en --- Quantizations for [Gryphe/MythoMax-L2-13b](https://huggingface.co/Gryphe/MythoMax-L2-13b) in the [EXL2 format](https://github.com/turboderp/exllamav2) Quant|Mem 4k|Mem 4k8|Δp|test wiki|test pippa ----|--|--|--|--|-- [4k_hb8_b8_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb8_b8_pippa)|17.2GB|15.7GB|-0.0183|5.7781|4.4221 [4k_hb6_b5_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b5_pippa)|12.6GB|11.0GB|-0.0197|6.0141|4.4252 [2k_hb8_b8_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb8_b8_pippa)|17.2GB|15.7GB|-0.0246?|5.7762|4.4238 [2k_hb6_b5_pippa](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb6_b5_pippa)|12.6GB|11.0GB|-0.0121|6.0823|4.4363 [4k_hb8_b8](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb8_b8)|17.2GB|15.7GB||5.7459|4.4247 [4k_hb6_b6](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b6)|15GB [4k_hb6_b5](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/4k_hb6_b5)|12.6GB|11.0GB||5.7699|4.4514 [2k_hb8_b8](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb8_b8)|17.2GB|15.7GB||5.7572|4.4242 [2k_hb6_b4.125](https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2/tree/2k_hb6_b4.125) Breaking down the names: - **4k** is calibrated with 4096 length as opposed to the default 2048 - **hb8** is a header depth of 8 bits - **b8** is a model weight average of 8.0 bits - **pippa** is calibrated with [pippa-llama2-chat](https://huggingface.co/datasets/jasonkstevens/pippa-llama2-chat) instead of wikitext Additional analysis: - **Δp** | (quant_perplexity - base_perplexty) as reported in stdout during quant creation. Unsure if useful - **test wiki** | Perplexity reported by `test_inference.py` against wikitext with 4096 length - **test pippa** | Perplexity reported by `test_inference.py` against pippa chat with 4096 length (Possible) Takeaways: - The difference between 2k and 4k 8bit calibration is about 0.01% averaged across both tests for pippa calibration and 0.1% for wiki calibration. Wiki 4k was also measured with 4x the row count which likely is part of the variance. - 4k 5bit pippa is substantially worse than 2k 5bit pippa relative to 4k 8bit pippa and 2k 8bit pippa. Larger calibration lengths (or possibly at least more rows) is likely strongly preferable for aggressive quants - Dropping from 8 to 5 bits results in a 4% increase in wiki perplexity on the pippa calibrated quants while only resulting in 0.2% for the wiki calibrated - Both calibrations increase <0.1% on 8 to 5 bits when tested against pippa chat. Detractors: - Real world usage may produce more visible differences than a hundreth of a percent on a small test - I have not tested using more rows over using a greater length - I have not tested increasing row count on the final quant for the same measurement - It's unclear if or how Δp should be used With this rather superficial methodology, wikitext with 4k settings seems like the safest general purpose quant speed/quality tradeoff. However, real world usage would probably favor picking datasets more related to the tune; all pippa calibrated sets performed fairly better on pippa test than their wiki calibrated counterparts. All quantizations were calibrated with [wikitext-2](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) unless otherwise specified You can run a model calibrated at 2k with a 4k context or vice versa. The actual difference between 2k and 4k calibrations appears to be very small. VRAM estimates are performed with an extremely long chatlog in [oobabooga webui](https://github.com/oobabooga/text-generation-webui) on a 7900 XTX using [nvtop](https://github.com/Syllo/nvtop) to monitor **pytorch usage only**. Systems with lots of extra background processes may use more. Additionally, NVIDIA based systems with [flash attention 2](https://github.com/Dao-AILab/flash-attention) **will use less VRAM** than otherwise estimated. The measurement files are provided in the main branch so you can [make your own quants](https://github.com/turboderp/exllamav2/blob/master/doc/convert.md) at other bit depths without going through the 2-3 hours of measuring.