Runs on 2×RTX 4090 via PR #24423 - notes

#2
by rob-x-ai - opened

Running these with the DiffusionGemma branch (PR #24423, llama-diffusion-cli) on 2× 4090:

  • Q4_K_M fits one 4090; Q8_0 splits cleanly across both with -ngl 99.
  • Entropy-bound sampler stops early - short replies finish in 13/48 steps (2.5s). --diffusion-visual is great.
  • CLI-only for now (no llama-server), and the bare CLI leaks the raw <|channel>thought tokens into output.

Build + a small launcher: https://github.com/kroonen-ai/diffusiongemma

Any timeline for llama-server support?

Unsloth AI org

Working on it!

Sign up or log in to comment