Gemma-3-R1984-27B EXL3

EXL3 quants of VIDraft/Gemma-3-R1984-27B using exllamav3 v0.0.34

KL Divergence vs VRAM

Reference: 6.0bpw. Lower KLD = closer to reference quality. Measured on wikitext-2 (20 rows, 2048 ctx).

Quants

Branch	BPW	Head	VRAM (GB)	KLD	Type
2.0bpw_H6	2.0	6	7.0	0.450	base
2.50bpw_H6	2.50	6	8.5	0.389	optimized
3.0bpw_H6	3.0	6	9.9	0.110	base
3.35bpw_H6	3.35	6	11.0	0.088	optimized
3.49bpw_H6	3.49	6	11.5	0.075	optimized
3.65bpw_H6	3.65	6	12.2	0.065	optimized
4.0bpw_H6	4.0	6	12.9	0.039	base
5.0bpw_H6	5.0	6	15.9	0.015	base
6.0bpw_H6	6.0	6	19.0	ref	base
7.0bpw_H6	7.0	6	~22	-	base
8.0bpw_H6	8.0	6	~29	-	base

Optimized variants use KLD-guided tensor mixing + attn@5bpw recompile. Bases are direct converts. 7.0/8.0bpw KLD not measured (exceed 32 GB VRAM).

Download

Download commands

Install CLI:

pip install -U "huggingface_hub[cli]"

Download a specific quant:

huggingface-cli download WeReCooking/Gemma-3-R1984-27B-EXL3 --revision "4.0bpw_H6" --local-dir ./

EXL3 quants run with TabbyAPI or any exllamav3-compatible backend.

Build Details

How these were made

Base quants: convert.py -b <bpw> (2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0)

KLD measurement: measure.py -r <ref> -ms 128 -i <2.0bpw> <8.0bpw>

Optimized (2.50, 3.35): optimize.py -i <lo> <hi> -m measurement.json -b <target> then recompile.py -or override.yaml with *.self_attn.* -> 5bpw

Note: Gemma-3 is dense (no MoE), so *.shared_experts.* is not applicable. Only optimized variants are recompiled; bases stay at exact bpw.

Docs: exllamav3 convert.md

Files

main branch: measurement.json (KLD map) + kld_plot.png

Each bpw branch: quantized model shards + config + tokenizer

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for WeReCooking/Gemma-3-R1984-27B-EXL3

Base model

google/gemma-3-27b-pt

Finetuned

VIDraft/Gemma-3-R1984-27B

Quantized

(10)

this model