Atomic Chat · DiffusionGemma 26B-A4B (GGUF)

GGUF quantizations of google/diffusiongemma-26B-A4B-it, self-quantized by Atomic Chat from Google's original weights.

This is a discrete diffusion language model. It does not generate token by token. It denoises a block of tokens (a "canvas") in parallel using block-autoregressive multi-canvas sampling. It is also a sparse MoE: 25.2B total parameters, 3.8B active (8 of 128 experts).

These run only with the DiffusionGemma build of llama.cpp, via the dedicated llama-diffusion-cli runner. The standard llama-cli / llama-server, Ollama, LM Studio and Jan cannot run these yet. Diffusion support is an open draft PR (ggml-org/llama.cpp#24423), not yet merged to master.

Quants

Quant Size Notes
Q4_K_M ~16.8 GB Recommended default. Best size / quality balance.
Q5_K_M ~19.1 GB Higher quality.
Q6_K ~22.7 GB Near lossless.
Q8_0 ~26.9 GB Effectively lossless, reference quality.

Quantized without an importance matrix (imatrix tooling does not yet cover diffusion decoding), matching the upstream approach.

How to run

Build the DiffusionGemma branch of llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git fetch origin pull/24423/head:diffusiongemma
git checkout diffusiongemma
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release --target llama-diffusion-cli

Generate:

./build/bin/llama-diffusion-cli \
    -hf AlexAtomic/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M \
    -p "Explain what a neural network is in two sentences." \
    --diffusion-steps 128 --diffusion-visual

Set -DGGML_CUDA=OFF for CPU or Metal builds. Add -ngl N to offload N layers to GPU.

Useful diffusion flags:

  • --diffusion-steps N denoising steps (default 128, fewer is faster).
  • --diffusion-eb auto|on|off entropy-bound decoder tuned for DiffusionGemma.
  • --diffusion-visual watch the canvas fill in progressively.

Model Overview

Property Value
Base model google/diffusiongemma-26B-A4B-it
Architecture diffusion-gemma (DiffusionGemmaForBlockDiffusion)
Total parameters 25.2B
Active parameters 3.8B (8 of 128 experts)
Generation block-autoregressive diffusion (parallel denoising)
This repo GGUF quants for llama-diffusion-cli

How these were made

  1. Download google/diffusiongemma-26B-A4B-it.
  2. Convert to f16 GGUF with the DiffusionGemma build of llama.cpp.
  3. Verify generation with llama-diffusion-cli.
  4. Quantize the ladder with llama-quantize.

License

These weights are derived from Gemma and stay governed by the Gemma Terms of Use. By downloading you agree to those terms. Original model by Google DeepMind. Quantized by Atomic Chat.

Downloads last month
358
GGUF
Model size
25B params
Architecture
diffusion-gemma
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlexAtomic/diffusiongemma-26B-A4B-it-GGUF

Quantized
(22)
this model