Gemma-3-R1984-27B EXL3
EXL3 quants of VIDraft/Gemma-3-R1984-27B using exllamav3 v0.0.34
KL Divergence vs VRAM
Reference: 6.0bpw. Lower KLD = closer to reference quality. Measured on wikitext-2 (20 rows, 2048 ctx).
Quants
| Branch | BPW | Head | VRAM (GB) | KLD | Type |
|---|---|---|---|---|---|
| 2.0bpw_H6 | 2.0 | 6 | 7.0 | 0.450 | base |
| 2.50bpw_H6 | 2.50 | 6 | 8.5 | 0.389 | optimized |
| 3.0bpw_H6 | 3.0 | 6 | 9.9 | 0.110 | base |
| 3.35bpw_H6 | 3.35 | 6 | 11.0 | 0.088 | optimized |
| 3.49bpw_H6 | 3.49 | 6 | 11.5 | 0.075 | optimized |
| 3.65bpw_H6 | 3.65 | 6 | 12.2 | 0.065 | optimized |
| 4.0bpw_H6 | 4.0 | 6 | 12.9 | 0.039 | base |
| 5.0bpw_H6 | 5.0 | 6 | 15.9 | 0.015 | base |
| 6.0bpw_H6 | 6.0 | 6 | 19.0 | ref | base |
| 7.0bpw_H6 | 7.0 | 6 | ~22 | - | base |
| 8.0bpw_H6 | 8.0 | 6 | ~29 | - | base |
Optimized variants use KLD-guided tensor mixing + attn@5bpw recompile. Bases are direct converts. 7.0/8.0bpw KLD not measured (exceed 32 GB VRAM).
Download
Download commands
pip install -U "huggingface_hub[cli]"
Download a specific quant:
huggingface-cli download WeReCooking/Gemma-3-R1984-27B-EXL3 --revision "4.0bpw_H6" --local-dir ./
EXL3 quants run with TabbyAPI or any exllamav3-compatible backend.
Build Details
How these were made
Base quants: convert.py -b <bpw> (2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0)
KLD measurement: measure.py -r <ref> -ms 128 -i <2.0bpw> <8.0bpw>
Optimized (2.50, 3.35): optimize.py -i <lo> <hi> -m measurement.json -b <target> then recompile.py -or override.yaml with *.self_attn.* -> 5bpw
Note: Gemma-3 is dense (no MoE), so *.shared_experts.* is not applicable. Only optimized variants are recompiled; bases stay at exact bpw.
Docs: exllamav3 convert.md
Files
main branch: measurement.json (KLD map) + kld_plot.png
Each bpw branch: quantized model shards + config + tokenizer