diffusiongemma-26B-A4B-it-mini-g32 — for 16 GB Macs

9.79 GB TurboQuant build of google/diffusiongemma-26B-A4B-it, sized to run on a 16 GB Apple Silicon Mac (Mac mini class), produced with TurboQuant-MLX. Sibling of the higher-fidelity tq3-g32 build (13.8 GB — needs ~18 GB peak, OOMs on 16 GB machines).

Precision layout

Component	Precision
MoE experts, layers 0,1,2,27,28,29 (protected)	tq3, group 32
MoE experts, layers 3–26	tq2, group 32
Attention q/k/v/o	tq3, group 32
Embeddings, dense per-layer MLP, vision tower	8-bit affine (g64)
Routers, self-conditioning, norms	bf16

Why the protection: raw 2-bit experts break arithmetic on this model (17×23 → "3"). Keeping the first/last three layers' experts at 3-bit restores it for +0.2 GB — measured: 17×23 = 391 ✓, multi-step chains correct, in-context recall exact, prose with only minor artifacts.

Measured (Apple Silicon)

	this repo (mini)	tq3-g32	mlx-community 4-bit
Size on disk	9.79 GB	13.8 GB	15.0 GB
Peak memory (`--max-tokens 120`)	~12.4 GB	~18 GB	~19 GB
Math / recall probes	pass	pass	pass

Verified on a 16 GB Mac mini (after the wired-limit bump below): 96 tokens at ~1.5 tok/s, 12.9 GB peak, coherent output. Speed knob: --max-denoising-steps 24 is ~2x faster at a mild quality cost.

On a 16 GB Mac the default Metal working-set limit is ~~12.1 GB and the mini build peaks just above it (~~12.4 GB at --max-tokens 120), so raise the Metal wired limit first (verified working on a 16 GB Mac mini; resets on reboot):

sudo sysctl iogpu.wired_limit_mb=13824

Alternatively keep the canvas (and its activations) smaller with --max-tokens 96.

Requirements

pip install "turboquant-mlx-full[vlm]>=0.8.0"

Quick Start

python -m turboquant_mlx.generate_vlm \
    --model manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32 \
    --prompt "Write a short paragraph about the ocean." \
    --max-tokens 120 --temp 0.0

Optional speed knob: --max-denoising-steps 24 (~2x faster, mild quality cost — quantized diffusion needs more denoising iterations than bf16).

Reproducing the conversion

python -m turboquant_mlx.convert_vlm \
    --hf-path google/diffusiongemma-26B-A4B-it \
    --mlx-path ./diffusiongemma-26B-A4B-it-mini-g32 \
    --bits 2 --attn-bits 3 -g 32 \
    --protect-expert-layers 0,1,2,27,28,29 --protect-bits 3 \
    --quantize-extras

License

Apache 2.0, subject to the Gemma license terms (same as the base model). Quantization tooling: TurboQuant-MLX.

Citation

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  year={2025},
  eprint={2504.19874},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2504.19874}
}

Downloads last month: 1,013

Safetensors

Model size

3B params

Tensor type

BF16

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32

Base model

google/diffusiongemma-26B-A4B-it

Finetuned

(10)

this model

Paper for manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 34