mo137
/

Amethyst-13B-Mistral-8bpw-hb8-exl2

Text Generation

Text Generation

Not-For-All-Audiences

nsfw

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Edit model card

Amethyst 13B Mistral - EXL2 - 8bpw, hb8

Model creator: Undi
Original model: Amethyst 13B Mistral

Description

8 bits per weight.
8 bits "for the lm_head (output) layer of the model," instead of the typical 6.
Works fine with 24 GB VRAM and no flash attention v2 under Windows.
For me runs at about 64% of the 4-bit GPTQ speed.

I converted the model using the convert.py script from the exllamav2 repo:
https://github.com/turboderp/exllamav2
Its documentation:
https://github.com/turboderp/exllamav2/blob/master/doc/convert.md

Measuring the model took 51 minutes, converting it 18 minutes.

I used the WikiText-2-v1 dataset for calibration:
https://huggingface.co/datasets/wikitext/blob/refs%2Fconvert%2Fparquet/wikitext-2-v1/test/0000.parquet

Downloads last month: 2

Text Generation

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.