DiffusionGemma 26B-A4B Tool-Selector LoRA (MLX)

A QLoRA adapter for google/diffusiongemma-26B-A4B-it (via the mlx-community 4-bit conversion) that selects which tools a coding agent will need for a task, trained entirely on Apple Silicon with a custom block-diffusion trainer.

To our knowledge this is the first fine-tune of DiffusionGemma trained on Apple Silicon. Concurrent work fine-tuned DiffusionGemma on CUDA (e.g. micic-mihajlo/diffusiongemma-social-writer-lora and alimpfard/diffusiongemma-ft-grammar, both via Unsloth/Transformers); to our knowledge this is the first trained on Apple Silicon with MLX, where no diffusion-aware trainer previously existed. No diffusion-aware trainer exists in the MLX ecosystem: mlx-lm has no support for the architecture, and mlx-vlm (>= 0.6.3) is inference-only β€” its SFT trainer computes autoregressive next-token loss, which is the wrong objective for a block-diffusion model. The trainer in code/ implements the correct denoising objective and is released so others can fine-tune DiffusionGemma on a Mac.

Training recipe (verified against primary sources)

The recipe was established from Google's JAX reference (hackable_diffusion adapter in the gemma repo), NVIDIA NeMo Automodel's dLLM recipes, and Unsloth's notebook:

  • Corruption β€” D3PM-uniform, not mask-based. One t ~ U(0.001, 1) per example; each canvas position is independently replaced with probability t by a token drawn uniformly from the 262,144-token vocab. The tokenizer's <mask> token is never used β€” this matches inference, which initializes/renoises the canvas with uniform-random tokens.
  • Canvas. Response-relative 256-token grid. Responses end with the <turn|> terminator and are EOS-filled (<eos>, id 1) to the canvas boundary; the fill is supervised and attended, so the model learns termination.
  • Loss. Flat unweighted cross-entropy over ALL canvas positions β€” corrupted and uncorrupted alike (Google's NoWeightDiscreteLoss; the corrupted-only variant was a documented NeMo bug). No 1/t weighting β€” that is the absorbing-kernel ELBO weight and does not apply to the uniform kernel. Logits via the tied embedding (embed_tokens.as_linear) with fp32 soft-cap 30.
  • LoRA targets (NeMo parity). r=16, alpha=32 (scale 2.0) on self_attn.{q,k,v,o}_proj (v_proj exists only on the sliding-attention layers) + dense mlp.{gate,up,down}_proj. MoE experts, router, and embeddings frozen. 205 wrapped layers, 18.6M trainable params. QLoRA: the base stays 4-bit quantized (LoRALinear.from_base is QuantizedLinear-aware).
  • Optimizer. AdamW (bias-corrected, torch-parity) lr 1.5e-4 -> cosine to 1.5e-5, 25-step warmup, betas (0.95, 0.99), wd 1e-4, grad-norm clip 1.0, micro-batch 1 x grad-accum 8, 250 steps (~3.5 epochs of the deduplicated training set).
  • v1 simplifications (Unsloth-proven): no co-trained encoder AR loss, no self-conditioning passes.

Results

Tool-selection benchmark: 124 held-out agent-trace samples (mean 2.97 tools/sample, 35 distinct tools), multi-label, greedy decoding (temperature 0), all rows scored by the identical harness on the identical split.

Benchmark hygiene. Our original dataset was 96% train/test contaminated (a trivial train-lookup baseline scored Jaccard 0.959 on it). We rebuilt it before training: exact-duplicate (prompt, response) pairs collapsed (3,935 rows β†’ 823 distinct), splits made group-aware by task identity so near-duplicate prompts from the same workflow cannot straddle splits, zero pair- and zero prompt-overlap programmatically asserted, split hashes frozen in the dataset manifest. On the clean test set the train-lookup baseline gets 0/124 exact-prompt hits.

Run Jaccard Exact set Precision Recall Top-1
Frequency top-3 floor 0.474 0.113 0.600 0.631 0.750
DiffusionGemma zero-shot 0.073 0.008 0.129 0.094 0.097
DiffusionGemma + this LoRA 0.447 0.105 0.566 0.607 0.750
Gemma-4-26B-A4B-it (AR sibling, zero-shot) 0.070 0.008 0.126 0.113 0.113
Qwen3.6-35B-A3B (zero-shot) 0.138 0.008 0.217 0.212 0.250
Qwen3.6-27B (zero-shot) 0.197 0.008 0.323 0.280 0.331

This adapter beats all zero-shot model rows on exact tool-name Jaccard, but it does not significantly clear the frequency floor. That negative is part of the result: on this trace-derived task, a static head-tool prior is brutally strong, and the follow-up paper reports retrieval and head/tail analyses rather than pretending the LoRA is a decisive floor-beater.

Reading the table: the frequency floor always predicts the three most common training tools (Bash, Read, Edit) with no oracle access to the true count β€” beating it is the minimum bar for the adapter to be interesting. The AR sibling is the cleanest scientific control: the same Gemma-4 26B-A4B backbone without the diffusion decoder, zero-shot on the identical harness (chat template applied per-model via its own tokenizer; identical test rows, parser, and metrics β€” see code/ar_eval.py). The Qwen3.6 rows test whether a 0.072%-of-weights specialist fine-tune beats newer, larger generalists β€” the xLAM/ToolACE pattern. Zero-shot rows use each model's own chat template; all rows share decoding budget (96 tokens, greedy).

Training resilience note: the published adapter was trained under a hostile GPU watchdog (macOS kills long Metal command buffers on contended/interactive systems). The run survived dozens of process kills via 5-step checkpointing with crash-resume β€” loss/val curves in train_log.jsonl therefore contain resume-boundary artifacts (Adam moments reset on each resume; duplicate step records are resolved keep-last).

Usage

Requires Apple Silicon, mlx-vlm >= 0.6.3 (mlx >= 0.31.2). Inference needs ~17 GB unified memory.

from huggingface_hub import snapshot_download
from mlx_vlm.utils import load
from mlx_vlm import generate

adapter_dir = snapshot_download("Fild/diffusiongemma-26B-A4B-it-tool-selector-lora-mlx")
model, processor = load(
    "mlx-community/diffusiongemma-26B-A4B-it-4bit",
    adapter_path=adapter_dir,
)

# DiffusionGemma chat format, thinking disabled (note the trailing space after the
# system turn β€” it byte-matches mlx-vlm's apply_chat_template rendering):
prompt = (
    "<bos><|turn>system\n"
    "You select the tools a coding agent will need. "
    "Reply with one '- ToolName' line per tool, chosen from the candidate list. <turn|>\n"
    "<|turn>user\n"
    "Task: find every TODO in the repo and fix the ones in src/\n"
    "Candidates: Bash, Read, Edit, Write, Grep, ...<turn|>\n"
    "<|turn>model\n<|channel>thought\n<channel|>"
)

out = generate(model, processor, prompt, max_tokens=96, temperature=0.0, verbose=False)
print(out.text)
# - Grep
# - Read
# - Edit

(The system/user text above is illustrative; match your own prompt format to your training data.)

To serve as a single fused model (e.g. for servers that do not load adapters), fuse with LoRALinear.fuse(dequantize=False) β€” the result re-quantizes to a ~16 GB 4-bit model.

Training your own (code included)

code/ contains the full pipeline:

  • diffusion_lora_train.py β€” the block-diffusion QLoRA trainer (the core contribution).
  • diffusion_eval.py β€” multi-label tool-selection benchmark (Jaccard / exact / precision / recall / top-1).
  • build_diffusiongemma_data.py β€” renders {messages: [...]} chat JSONL into DiffusionGemma {prompt, response} pairs.
  • overnight_diffusiongemma.sh β€” eval -> train -> eval chain with crash-resume.
python3 code/diffusion_lora_train.py \
    --model ./diffusiongemma-26B-A4B-it-4bit \
    --data ./data \
    --adapter-path ./adapters/my-adapter \
    --smoke   # 3 forward/backward sanity iters first

Data format: train.jsonl / valid.jsonl with {"prompt": str, "response": str} per line; prompts stored without <bos> (the trainer prepends the id), responses end with <turn|>.

Hardware and operational notes (Apple Silicon)

Trained on a Mac Studio M2 Max, 64 GB unified memory. Use --grad-checkpoint: in our runs it was faster than the unchunked backward (per-layer recomputation suits this MoE) and drops peak training memory from 28.6 GB to **17.5 GB** at micro-batch 1 (prompt <= 1920 tokens + 256-token canvas).

  • macOS GPU watchdog. Metal kills long-running training command buffers (kIOGPUCommandBufferCallbackErrorImpactingInteractivity) β€” in our experience even with an idle console, and buffer caps alone did not always prevent it. The robust configuration is --grad-checkpoint (per-layer dispatch chunks) plus export MLX_MAX_OPS_PER_BUFFER=2 MLX_MAX_MB_PER_BUFFER=10. Kills can still be intermittent: the included chain script implements checkpoint crash-resume (--resume-file/--start-step; the LR schedule is offset, Adam moments are lost β€” acceptable for SFT).
  • Memory discipline. Gradients are materialized every micro-step (mx.eval inside the accum loop) so memory stays at single-step peak instead of stacking lazy backward graphs. mx.set_cache_limit defaults to 2 GB. Do not run a second model copy alongside training on a 64 GB box.
  • Detaching over SSH. nohup can fail over some SSH transports ("can't detach from console"); use bash -c 'cmd < /dev/null > log 2>&1 & disown'.
  • Diffusion eval recompilation. mlx-vlm's diffusion generation engine mx.compiles its decoder graphs per prompt shape. On an eval set with varied prompt lengths this can recompile on nearly every sample (minutes per sample of pure CPU). Set MLX_DISABLE_COMPILE=1 for benchmark sweeps over varied-length prompts; training in this repo is unaffected (the trainer does not use mx.compile).
  • bash 3.2 footgun. macOS ships bash 3.2, where expanding an empty array under set -u is a fatal "unbound variable" error β€” use the guarded ${ARR[@]+"${ARR[@]}"} idiom in retry scripts (the included chain script does).

Limitations

  • Task-specific. This adapter does one thing: pick tool names from a candidate list for agent traces formatted like its training data. It is not a general assistant tune and degrades the base model elsewhere.
  • Private training data. The agent-trace dataset (3,148 train / 393 valid / 394 test) is derived from private development sessions and is not released. The data-builder script documents the exact format so you can reproduce on your own traces.
  • Single-machine validation. Trained and evaluated on one M2 Max; no multi-seed runs, no other hardware tested.
  • v1 recipe. Omits Google/NeMo's co-trained encoder AR loss and two-pass self-conditioning; quality on harder generative tasks may benefit from adding them (v2 candidates).
  • Quantized-base adapter. Weights were trained against the 4-bit quantized base and should be applied to (or fused into) that conversion, not the bf16 original.

Acknowledgements and license

Apache-2.0 (matching the base model; see the Gemma 4 license note). Base model by Google DeepMind; 4-bit MLX conversion by mlx-community; recipe reconstructed from Google's JAX hackable_diffusion reference, NVIDIA NeMo Automodel, and Unsloth's notebook; built on MLX / mlx-vlm. This repository states its modifications: a LoRA adapter fine-tuned on a private agent-trace tool-selection dataset.

DiffusionGemma is a trademark-adjacent model name of Google; this community fine-tune is not affiliated with or endorsed by Google. === END ARTIFACT 1 ===

Authors

Rud Lord and the KnowledgeOS Agents β€” KnowledgeOS Collective (a human–agent collaboration).

Citation

@misc{knowledgeos2026floor,
  title  = {Beating the Floor Is Hard: A Leakage-Controlled Study of On-Device Tool
            Selection with Block-Diffusion and Autoregressive LoRA on Apple Silicon},
  author = {Rud Lord and the KnowledgeOS Agents},
  year   = {2026},
  note   = {KnowledgeOS Collective}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Fild/diffusiongemma-26B-A4B-it-tool-selector-lora-mlx