Instructions to use mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit") config = load_config("mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit
Run Hermes
hermes
diffusiongemma-26B-A4B-it-OptiQ-4bit
Built with mlx-optiq, the MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon, no PyTorch and no cloud. Try the Lab · All OptIQ quants · Docs
OptIQ data-driven mixed-precision quant of Google's DiffusionGemma-26B-A4B-it, a block/masked-diffusion LLM (image-text-to-text), the first diffusion model in the OptIQ lineup.
Instead of uniform 4-bit, OptIQ measures each layer's quantization sensitivity (KL on the denoising-canvas logits) and spends an 8-bit budget where it helps most. At the same ~4.66 bpw as the standard published 4-bit, OptIQ shifts the 8-bit budget from the dense-MLP (where the hand-coded recipe puts it) onto early-layer attention + routers (which the measurement shows are more sensitive).
⚠️ Requires
mlx-optiq≥ 0.2.3. DiffusionGemma is not loadable by stockmlx-lm/mlx-vlm; OptIQ ships a vendored, dependency-free decoder for it.
Capability Score
Full 6-metric OptIQ Capability Score (optiq eval --task all --score), vs the published -4bit (mlx-vlm's hand-coded recipe) at equal bpw:
| Benchmark | OptIQ-4bit | published-4bit | Δ |
|---|---|---|---|
| MMLU (1000, 5-shot) | 47.4 | 44.5 | +2.9 |
| GSM8K (1000) | 91.8 | 91.7 | +0.1 |
| IFEval (strict) | 69.1 | 68.9 | +0.2 |
| BFCL v3 | 68.5 | 68.5 | +0.0 |
| HumanEval (pass@1) | 75.6 | 74.4 | +1.2 |
| HashHop | 7.0 | 11.0 | −4.0 |
| Capability Score | 59.90 | 59.84 | +0.07 |
| Disk | 14.0 GB | 14.5 GB | −0.5 GB |
OptIQ matches or beats the hand-tuned recipe on 5 of 6 benchmarks, with clear wins on the non-saturated ones (MMLU +2.9, HumanEval +1.2), while being 0.5 GB smaller. (HashHop is ~0 for both: the fixed 256-token canvas can't do 12k-context retrieval; the −4.0 is noise on near-zero scores.)
Usage
from optiq.vlm.diffusion_gemma import load, generate
model, tokenizer = load("mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit")
# text
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Write a haiku about Apple Silicon."}],
tokenize=False, add_generation_prompt=True)
print(generate(model, tokenizer, prompt))
# image + text
from PIL import Image
print(generate(model, tokenizer, "What is in this image?", images=[Image.open("photo.jpg")]))
Best inference config
DiffusionGemma decodes by iteratively un-masking a fixed 256-token canvas. The sampler choice dominates speed:
| sampler | code | prose |
|---|---|---|
entropy-bound (model default) |
12.7 tok/s | 1.8 tok/s |
confidence-threshold (OptIQ default) |
58 tok/s | 9 tok/s |
OptIQ defaults to confidence-threshold (generate(..., sampler="confidence-threshold")), 4.6–5× faster than the model's default, with no quality loss. On code it's comparable to the autoregressive Gemma-4 26B-A4B (~60 tok/s); on prose it's slower (diffusion's strength is structured/parallel-friendly output).
LoRA fine-tuning
OptIQ ships a diffusion-native LoRA trainer (the model's denoising objective, not autoregressive cross-entropy):
from optiq.vlm.diffusion_gemma.lora import train_diffusion_lora, load_diffusion_lora
train_diffusion_lora(model_path, "data/", "adapter/", rank=8) # data/train.jsonl: {prompt, completion}
model, tok = load_diffusion_lora(model_path, "adapter/")
Feature support
| OptIQ feature | DiffusionGemma |
|---|---|
| Mixed-precision quant | ✅ |
| Text + image generation | ✅ |
| LoRA fine-tuning | ✅ (diffusion-native denoising loss) |
| MTP / speculative / assistant draft | , N/A (diffusion is not autoregressive; parallel canvas un-masking is the native analog) |
| KV-cache quant | , N/A (fixed 256-token canvas; the cache holds only the prompt) |
How it was made
optiq convert measured per-layer KL sensitivity on the masked-diffusion forward (uniform-4 reference, candidate bits {4,8}), ran the greedy-knapsack allocator at the published recipe's 8-bit budget, and quantized via the OptIQ pipeline. The 27-layer SigLIP vision tower is kept and quantized alongside the language tower.
Built with OptIQ. Vendored DiffusionGemma decoder derived from mlx-vlm (MIT).
Quantize your own
This quant was produced by mlx-optiq. Point it at any Hugging Face model to get the same sensitivity-aware mixed precision:
pip install mlx-optiq
optiq convert <hf-model-id> --target-bpw 5.0 --candidate-bits 4,8
optiq lab # full local workbench: chat, compare, quantize, fine-tune
- Downloads last month
- 252
4-bit
Model tree for mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit
Base model
google/diffusiongemma-26B-A4B-it