diffusiongemma-26B-A4B-it-mini-g32 — for 16 GB Macs

9.79 GB TurboQuant build of google/diffusiongemma-26B-A4B-it, sized to run on a 16 GB Apple Silicon Mac (Mac mini class), produced with TurboQuant-MLX. Sibling of the higher-fidelity tq3-g32 build (13.8 GB — needs ~18 GB peak, OOMs on 16 GB machines).

Precision layout

Component Precision
MoE experts, layers 0,1,2,27,28,29 (protected) tq3, group 32
MoE experts, layers 3–26 tq2, group 32
Attention q/k/v/o tq3, group 32
Embeddings, dense per-layer MLP, vision tower 8-bit affine (g64)
Routers, self-conditioning, norms bf16

Why the protection: raw 2-bit experts break arithmetic on this model (17×23 → "3"). Keeping the first/last three layers' experts at 3-bit restores it for +0.2 GB — measured: 17×23 = 391 ✓, multi-step chains correct, in-context recall exact, prose with only minor artifacts.

Measured (Apple Silicon)

this repo (mini) tq3-g32 mlx-community 4-bit
Size on disk 9.79 GB 13.8 GB 15.0 GB
Peak memory (--max-tokens 120) ~12.4 GB ~18 GB ~19 GB
Math / recall probes pass pass pass

Verified on a 16 GB Mac mini (after the wired-limit bump below): 96 tokens at ~1.5 tok/s, 12.9 GB peak, coherent output. Speed knob: --max-denoising-steps 24 is ~2x faster at a mild quality cost.

On a 16 GB Mac the default Metal working-set limit is 12.1 GB and the mini build peaks just above it (12.4 GB at --max-tokens 120), so raise the Metal wired limit first (verified working on a 16 GB Mac mini; resets on reboot):

sudo sysctl iogpu.wired_limit_mb=13824

Alternatively keep the canvas (and its activations) smaller with --max-tokens 96.

Requirements

pip install "turboquant-mlx-full[vlm]>=0.8.0"

Quick Start

python -m turboquant_mlx.generate_vlm \
    --model manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32 \
    --prompt "Write a short paragraph about the ocean." \
    --max-tokens 120 --temp 0.0

Optional speed knob: --max-denoising-steps 24 (~2x faster, mild quality cost — quantized diffusion needs more denoising iterations than bf16).

Reproducing the conversion

python -m turboquant_mlx.convert_vlm \
    --hf-path google/diffusiongemma-26B-A4B-it \
    --mlx-path ./diffusiongemma-26B-A4B-it-mini-g32 \
    --bits 2 --attn-bits 3 -g 32 \
    --protect-expert-layers 0,1,2,27,28,29 --protect-bits 3 \
    --quantize-extras

License

Apache 2.0, subject to the Gemma license terms (same as the base model). Quantization tooling: TurboQuant-MLX.

Copyright 2026 Manjunath Janardhan.

Citation

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  year={2025},
  eprint={2504.19874},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2504.19874}
}
Downloads last month
1,013
Safetensors
Model size
3B params
Tensor type
BF16
·
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32

Finetuned
(10)
this model

Paper for manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32