diffusiongemma-26B-A4B-it-OptiQ-4bit

Built with mlx-optiq, the MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon, no PyTorch and no cloud. Try the Lab · All OptIQ quants · Docs

OptIQ data-driven mixed-precision quant of Google's DiffusionGemma-26B-A4B-it, a block/masked-diffusion LLM (image-text-to-text), the first diffusion model in the OptIQ lineup.

Instead of uniform 4-bit, OptIQ measures each layer's quantization sensitivity (KL on the denoising-canvas logits) and spends an 8-bit budget where it helps most. At the same ~4.66 bpw as the standard published 4-bit, OptIQ shifts the 8-bit budget from the dense-MLP (where the hand-coded recipe puts it) onto early-layer attention + routers (which the measurement shows are more sensitive).

⚠️ Requires mlx-optiq ≥ 0.2.3. DiffusionGemma is not loadable by stock mlx-lm/mlx-vlm; OptIQ ships a vendored, dependency-free decoder for it.

Capability Score

Full 6-metric OptIQ Capability Score (optiq eval --task all --score), vs the published -4bit (mlx-vlm's hand-coded recipe) at equal bpw:

Benchmark	OptIQ-4bit	published-4bit	Δ
MMLU (1000, 5-shot)	47.4	44.5	+2.9
GSM8K (1000)	91.8	91.7	+0.1
IFEval (strict)	69.1	68.9	+0.2
BFCL v3	68.5	68.5	+0.0
HumanEval (pass@1)	75.6	74.4	+1.2
HashHop	7.0	11.0	−4.0
Capability Score	59.90	59.84	+0.07
Disk	14.0 GB	14.5 GB	−0.5 GB

OptIQ matches or beats the hand-tuned recipe on 5 of 6 benchmarks, with clear wins on the non-saturated ones (MMLU +2.9, HumanEval +1.2), while being 0.5 GB smaller. (HashHop is ~0 for both: the fixed 256-token canvas can't do 12k-context retrieval; the −4.0 is noise on near-zero scores.)

Usage

from optiq.vlm.diffusion_gemma import load, generate

model, tokenizer = load("mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit")

# text
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Write a haiku about Apple Silicon."}],
    tokenize=False, add_generation_prompt=True)
print(generate(model, tokenizer, prompt))

# image + text
from PIL import Image
print(generate(model, tokenizer, "What is in this image?", images=[Image.open("photo.jpg")]))

Best inference config

DiffusionGemma decodes by iteratively un-masking a fixed 256-token canvas. The sampler choice dominates speed:

sampler	code	prose
`entropy-bound` (model default)	12.7 tok/s	1.8 tok/s
`confidence-threshold` (OptIQ default)	58 tok/s	9 tok/s

OptIQ defaults to confidence-threshold (generate(..., sampler="confidence-threshold")), 4.6–5× faster than the model's default, with no quality loss. On code it's comparable to the autoregressive Gemma-4 26B-A4B (~60 tok/s); on prose it's slower (diffusion's strength is structured/parallel-friendly output).

LoRA fine-tuning

OptIQ ships a diffusion-native LoRA trainer (the model's denoising objective, not autoregressive cross-entropy):

from optiq.vlm.diffusion_gemma.lora import train_diffusion_lora, load_diffusion_lora
train_diffusion_lora(model_path, "data/", "adapter/", rank=8)   # data/train.jsonl: {prompt, completion}
model, tok = load_diffusion_lora(model_path, "adapter/")

Feature support

OptIQ feature	DiffusionGemma
Mixed-precision quant	✅
Text + image generation	✅
LoRA fine-tuning	✅ (diffusion-native denoising loss)
MTP / speculative / assistant draft	, N/A (diffusion is not autoregressive; parallel canvas un-masking is the native analog)
KV-cache quant	, N/A (fixed 256-token canvas; the cache holds only the prompt)

How it was made

optiq convert measured per-layer KL sensitivity on the masked-diffusion forward (uniform-4 reference, candidate bits {4,8}), ran the greedy-knapsack allocator at the published recipe's 8-bit budget, and quantized via the OptIQ pipeline. The 27-layer SigLIP vision tower is kept and quantized alongside the language tower.

Built with OptIQ. Vendored DiffusionGemma decoder derived from mlx-vlm (MIT).

Quantize your own

This quant was produced by mlx-optiq. Point it at any Hugging Face model to get the same sensitivity-aware mixed precision:

pip install mlx-optiq
optiq convert <hf-model-id> --target-bpw 5.0 --candidate-bits 4,8
optiq lab   # full local workbench: chat, compare, quantize, fine-tune

Downloads last month: 252

Safetensors

Model size

26B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit

Base model

google/diffusiongemma-26B-A4B-it

Quantized

(25)

this model