diffusiongemma-26B-A4B-it-OptiQ-4bit

Built with mlx-optiq, the MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon, no PyTorch and no cloud. Try the Lab · All OptIQ quants · Docs

OptIQ data-driven mixed-precision quant of Google's DiffusionGemma-26B-A4B-it, a block/masked-diffusion LLM (image-text-to-text), the first diffusion model in the OptIQ lineup.

Instead of uniform 4-bit, OptIQ measures each layer's quantization sensitivity (KL on the denoising-canvas logits) and spends an 8-bit budget where it helps most. At the same ~4.66 bpw as the standard published 4-bit, OptIQ shifts the 8-bit budget from the dense-MLP (where the hand-coded recipe puts it) onto early-layer attention + routers (which the measurement shows are more sensitive).

⚠️ Requires mlx-optiq ≥ 0.2.3. DiffusionGemma is not loadable by stock mlx-lm/mlx-vlm; OptIQ ships a vendored, dependency-free decoder for it.

Capability Score

Full 6-metric OptIQ Capability Score (optiq eval --task all --score), vs the published -4bit (mlx-vlm's hand-coded recipe) at equal bpw:

Benchmark OptIQ-4bit published-4bit Δ
MMLU (1000, 5-shot) 47.4 44.5 +2.9
GSM8K (1000) 91.8 91.7 +0.1
IFEval (strict) 69.1 68.9 +0.2
BFCL v3 68.5 68.5 +0.0
HumanEval (pass@1) 75.6 74.4 +1.2
HashHop 7.0 11.0 −4.0
Capability Score 59.90 59.84 +0.07
Disk 14.0 GB 14.5 GB −0.5 GB

OptIQ matches or beats the hand-tuned recipe on 5 of 6 benchmarks, with clear wins on the non-saturated ones (MMLU +2.9, HumanEval +1.2), while being 0.5 GB smaller. (HashHop is ~0 for both: the fixed 256-token canvas can't do 12k-context retrieval; the −4.0 is noise on near-zero scores.)

Usage

from optiq.vlm.diffusion_gemma import load, generate

model, tokenizer = load("mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit")

# text
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Write a haiku about Apple Silicon."}],
    tokenize=False, add_generation_prompt=True)
print(generate(model, tokenizer, prompt))

# image + text
from PIL import Image
print(generate(model, tokenizer, "What is in this image?", images=[Image.open("photo.jpg")]))

Best inference config

DiffusionGemma decodes by iteratively un-masking a fixed 256-token canvas. The sampler choice dominates speed:

sampler code prose
entropy-bound (model default) 12.7 tok/s 1.8 tok/s
confidence-threshold (OptIQ default) 58 tok/s 9 tok/s

OptIQ defaults to confidence-threshold (generate(..., sampler="confidence-threshold")), 4.6–5× faster than the model's default, with no quality loss. On code it's comparable to the autoregressive Gemma-4 26B-A4B (~60 tok/s); on prose it's slower (diffusion's strength is structured/parallel-friendly output).

LoRA fine-tuning

OptIQ ships a diffusion-native LoRA trainer (the model's denoising objective, not autoregressive cross-entropy):

from optiq.vlm.diffusion_gemma.lora import train_diffusion_lora, load_diffusion_lora
train_diffusion_lora(model_path, "data/", "adapter/", rank=8)   # data/train.jsonl: {prompt, completion}
model, tok = load_diffusion_lora(model_path, "adapter/")

Feature support

OptIQ feature DiffusionGemma
Mixed-precision quant
Text + image generation
LoRA fine-tuning ✅ (diffusion-native denoising loss)
MTP / speculative / assistant draft , N/A (diffusion is not autoregressive; parallel canvas un-masking is the native analog)
KV-cache quant , N/A (fixed 256-token canvas; the cache holds only the prompt)

How it was made

optiq convert measured per-layer KL sensitivity on the masked-diffusion forward (uniform-4 reference, candidate bits {4,8}), ran the greedy-knapsack allocator at the published recipe's 8-bit budget, and quantized via the OptIQ pipeline. The 27-layer SigLIP vision tower is kept and quantized alongside the language tower.

Built with OptIQ. Vendored DiffusionGemma decoder derived from mlx-vlm (MIT).

Quantize your own

This quant was produced by mlx-optiq. Point it at any Hugging Face model to get the same sensitivity-aware mixed precision:

pip install mlx-optiq
optiq convert <hf-model-id> --target-bpw 5.0 --candidate-bits 4,8
optiq lab   # full local workbench: chat, compare, quantize, fine-tune
Downloads last month
252
Safetensors
Model size
26B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit

Quantized
(25)
this model