Beinsezii
/

Mythalion-13b-EXL2

Model card Files Files and versions Community

Mythalion-13b-EXL2 / README.md

Beinsezii's picture

BRACKET

c2924c2 9 months ago

|

raw history blame contribute delete

No virus

1.73 kB

	---
	language:
	- en
	---
	Quantizations for [PygmalionAI/mythalion-13b](https://huggingface.co/PygmalionAI/mythalion-13b) in the [EXL2 format](https://github.com/turboderp/exllamav2)

	Quant\|VRAM estimate\|Additional
	---\|---\|---
	[4k_hb8_b8](https://huggingface.co/Beinsezii/Mythalion-13b-EXL2/tree/4k_hb8_b8)\|18GB\|Recommended!
	[4k_hb6_b6](https://huggingface.co/Beinsezii/Mythalion-13b-EXL2/tree/4k_hb6_b6)\|15GB\|
	[4k_hb6_b5](https://huggingface.co/Beinsezii/Mythalion-13b-EXL2/tree/4k_hb6_b5)\|13GB\|Should fit in 12GB cards with 2k context


	Breaking down the names:
	- 4k is calibrated with 4096 context @ 82 rows (maximum for wikitext) as opposed to the default 2048 context @ 100 rows.
	- hb8 is a header depth of 8 bits
	- b8 is a model weight average of 8.0 bits

	All quantizations were calibrated with [wikitext-2](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test)

	You can run a model calibrated at 2k with a 4k context or vice versa. The actual difference between 2k and 4k calibrations appears to be very small.

	VRAM estimates are performed with an extremely long chatlog in [oobabooga webui](https://github.com/oobabooga/text-generation-webui) on a 7900 XTX using [nvtop](https://github.com/Syllo/nvtop) to monitor pytorch usage only, rounded up. Systems with lots of extra background processes may use more. Additionally, NVIDIA based systems with [flash attention 2](https://github.com/Dao-AILab/flash-attention) will use less VRAM than otherwise estimated.

	The measurement files are provided in the main branch so you can [make your own quants](https://github.com/turboderp/exllamav2/blob/master/doc/convert.md) at other bit depths without going through the 2-3 hours of measuring.