Gemma-3-R1984-27B EXL3

EXL3 quants of VIDraft/Gemma-3-R1984-27B using exllamav3 v0.0.34

KL Divergence vs VRAM

KLD plot

Reference: 6.0bpw. Lower KLD = closer to reference quality. Measured on wikitext-2 (20 rows, 2048 ctx).

Quants

BranchBPWHeadVRAM (GB)KLDType
2.0bpw_H6 2.0 6 7.0 0.450 base
2.50bpw_H6 2.50 6 8.5 0.389 optimized
3.0bpw_H6 3.0 6 9.9 0.110 base
3.35bpw_H6 3.35 6 11.0 0.088 optimized
3.49bpw_H6 3.49 6 11.5 0.075 optimized
3.65bpw_H6 3.65 6 12.2 0.065 optimized
4.0bpw_H6 4.0 6 12.9 0.039 base
5.0bpw_H6 5.0 6 15.9 0.015 base
6.0bpw_H6 6.0 6 19.0 ref base
7.0bpw_H6 7.0 6 ~22 - base
8.0bpw_H6 8.0 6 ~29 - base

Optimized variants use KLD-guided tensor mixing + attn@5bpw recompile. Bases are direct converts. 7.0/8.0bpw KLD not measured (exceed 32 GB VRAM).

Download

Download commands
Install CLI:
pip install -U "huggingface_hub[cli]"
Download a specific quant:
huggingface-cli download WeReCooking/Gemma-3-R1984-27B-EXL3 --revision "4.0bpw_H6" --local-dir ./

EXL3 quants run with TabbyAPI or any exllamav3-compatible backend.

Build Details

How these were made

Base quants: convert.py -b <bpw> (2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0)

KLD measurement: measure.py -r <ref> -ms 128 -i <2.0bpw> <8.0bpw>

Optimized (2.50, 3.35): optimize.py -i <lo> <hi> -m measurement.json -b <target> then recompile.py -or override.yaml with *.self_attn.* -> 5bpw

Note: Gemma-3 is dense (no MoE), so *.shared_experts.* is not applicable. Only optimized variants are recompiled; bases stay at exact bpw.

Docs: exllamav3 convert.md

Files

main branch: measurement.json (KLD map) + kld_plot.png

Each bpw branch: quantized model shards + config + tokenizer

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WeReCooking/Gemma-3-R1984-27B-EXL3

Quantized
(10)
this model