gemma-4-26B-A4B-it — Cactus CQ (calibrated)

2-, 3-, and 4-bit quantizations of google/gemma-4-26B-A4B-it in the Cactus .weights format for on-device (ARM) inference.

Method

  • CQ Cactus codebook quantization: per-group Hadamard rotation + (2/3-bit) AWQ-style activation scaling + routing-aware per-expert GPTQ — every MoE expert calibrated with its own routed-token Hessian.
  • Embeddings: CQ4 (orthogonal). Norms / router / biases: FP16.
  • Calibration: ~2M tokens of WildChat + AceCode trajectories generated by the model with thinking enabled.
  • 4-bit uses plain RTN (no GPTQ/AWQ): at 4-bit the activation-scaling/GPTQ calibration is net-harmful (a known high-bit AWQ effect), so RTN is the best-performing 4-bit and keeps quality monotonic with bit-width.

Quality — held-out completion perplexity (56k answer tokens, ±~2 PPL noise)

Variant PPL
bf16 baseline 7.25
2-bit RTN (uncalibrated) 33,827
2-bit calibrated 6.81
3-bit RTN (uncalibrated) 32.56
3-bit calibrated 6.32
4-bit (RTN) 6.00

Files

  • weights/gemma-4-26b-a4b-it-cq2.zip — 2-bit calibrated (~2.36 bits/weight overall)
  • weights/gemma-4-26b-a4b-it-cq3.zip — 3-bit calibrated
  • weights/gemma-4-26b-a4b-it-cq4.zip — 4-bit RTN

Runs on-device via the Cactus runtime (ARM).

Downloads last month
158
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Cactus-Compute/gemma-4-26B-A4B-it

Finetuned
(124)
this model