An ExLlamaV3 build of
Qwen/Qwen2.5-0.5B-Instructat 4.5 bits per weight: the quality-leaning sweet spot: comfortable on a single 24 GB consumer GPU, effectively indistinguishable from FP16 on most reasoning tasks. See Quants for sibling repos at other bit‑widths or browse the collection.
Quants
Inference
| Loader | Use it for |
|---|---|
| TabbyAPI | OpenAI‑compatible HTTP server. Drop‑in for OpenAI clients. |
| text‑generation‑webui | Local chat UI. Pick the ExLlamaV3 loader from the model dropdown. |
| ExLlamaV3 | Direct Python API for embedding the model in your own code or pipeline. |
VRAM at 4.5 bpw: weights on disk + ~2 GB context overhead. Comfortable on a single 24 GB card with room for ~16k tokens of context; fits a 16 GB card with a reduced context window.
Download
pip install -U huggingface_hub
hf download \
blockblockblock/Qwen2.5-0.5B-Instruct-exl3-4.5bpw \
--local-dir ./Qwen2.5-0.5B-Instruct-exl3-4.5bpw
Quantization recipe (advanced, embedded in quantization_config.json)
| Setting | Value |
|---|---|
| Format | EXL3 |
| Bits per weight | 4.5 |
| Head bits | 8 |
| Calibration rows | 250 |
| Codebook | MCG |
| Out‑scales | always |
| Parallel mode | enabled |
Loaded automatically by every ExLlamaV3 loader; reproduced here for searchability.
License & use
Use and license follow the base model. Quantization adds no additional restrictions. Refer to the upstream repository for terms, citation, and safety documentation.
Quantized with BlockQuant · convention
{org}/{model}-exl3-{bpw}bpw
- Downloads last month
- 118
Model tree for blockblockblock/Qwen2.5-0.5B-Instruct-exl3-4.5bpw
Collection including blockblockblock/Qwen2.5-0.5B-Instruct-exl3-4.5bpw
Collection
EXL3 quants of Qwen2.5-0.5B-Instruct, produced by BlockQuant. • 3 items • Updated