File size: 1,421 Bytes

a71f6f9
 
 
 
cbeb5ba
a71f6f9
 
cbeb5ba
a71f6f9
bba9c46
a71f6f9
 
 
 
 
 
 
 
 
 
cbeb5ba

---
language:
- en
---
Quantizations for [Sao10K/Stheno-1.8-L2-13B](https://huggingface.co/Sao10K/Stheno-1.8-L2-13B) in the [EXL2 format](https://github.com/turboderp/exllamav2)

Quant|Mem 4k|Mem 4k 8b
-----|------|---------
[4k_h8_b8](https://huggingface.co/Beinsezii/Stheno-1.8-L2-13B-EXL2/tree/4k_h8_b8)|17.2GB|15.7GB
[4k_h6_b5](https://huggingface.co/Beinsezii/Stheno-1.8-L2-13B-EXL2/tree/4k_h6_b5)|12.6GB|11.0GB

Breaking down the names:
 - **4k** is calibrated with 4096 length as opposed to the default 2048
 - **h8** is a header depth of 8 bits
 - **b8** is a model weight average of 8.0 bits

All quantizations were calibrated with [wikitext-2](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/train) unless otherwise specified

MEM estimates are performed with an extremely long chatlog in [oobabooga webui](https://github.com/oobabooga/text-generation-webui) on a 7900 XTX using [nvtop](https://github.com/Syllo/nvtop) to monitor **pytorch usage only**. Systems with lots of extra background processes may use more. Additionally, NVIDIA based systems with [flash attention 2](https://github.com/Dao-AILab/flash-attention) **will use less VRAM** than otherwise estimated.

The measurement files are provided in the main branch so you can [make your own quants](https://github.com/turboderp/exllamav2/blob/master/doc/convert.md) at other bit depths without going through the 2-3 hours of measuring.