Qwen3.6-35B-A3B — Code imatrix GGUF

GGUF quantizations of Qwen/Qwen3.6-35B-A3B with importance matrix calibration, produced by DuoNeural.

Calibrated on a code-focused corpus (Python algorithms, transformer architectures, reasoning traces) for better quality on technical and reasoning tasks.


Downloads

File Size Use When
qwen36_35b_Q4_K_M.gguf 20 GB Daily driver — best quality/size balance, recommended
qwen36_35b_IQ4_XS.gguf 18 GB Smallest, still excellent with imatrix calibration
qwen36_35b_Q5_K_M.gguf 24 GB Near-lossless, for quality-first setups

Why are these bigger than typical 35B quants? Qwen3.6's vocabulary is large and the embedding table stays in higher precision. The MoE expert weights (where quality matters most) are what imatrix actually targets.


About This Model

Qwen3.6-35B-A3B is a hybrid MoE architecture from the Qwen team with some genuinely interesting properties:

  • 35B total / 3B active — 256 experts, top-8 routing per token. Fast inference despite the parameter count.
  • 75% Gated DeltaNet + 25% softmax attention — uses linear recurrent attention (DeltaNet) for most layers, with full attention every 4th layer. Same mechanism as BitNet DeltaNet architectures, at scale.
  • 40 layers in a repeating pattern: 3× DeltaNet+MoE → 1× GatedAttn+MoE (×10)
  • 1M token context window

The DeltaNet-dominant architecture means this model has different inference characteristics than pure-transformer MoEs — it's particularly strong on long-context tasks and code generation.


Why imatrix?

Standard quantization treats all weights equally. Importance matrix (imatrix) calibration runs the model on representative text first, identifies which weight components matter most for output quality, and biases quantization to preserve them at the cost of less important ones.

For MoE models especially, this matters: different experts activate for different inputs, and naive quantization can disproportionately damage rarely-activated experts. imatrix calibration on a code + reasoning corpus helps ensure the technical reasoning experts stay sharp.

Our calibration corpus: Python code (algorithms, ML architectures, data structures) + reasoning traces with <think> tags. 370 samples, ~0.26M chars. Compact but focused.


Usage

Recommended flags for hybrid DeltaNet/attention (avoids bimodal KV cache issues):

llama-cli -m qwen36_35b_Q4_K_M.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 32768 -ngl 99 \
  -p "Your prompt here"

Use --cache-type-k q8_0 not q4 — the rotating KV cache can desync the DeltaNet state at 4-bit, causing degraded outputs on long contexts.

For CPU+GPU hybrid (e.g. GTX 1070 8GB with 48GB system RAM):

llama-cli -m qwen36_35b_Q4_K_M.gguf \
  -ngl 99 --n-cpu-moe 48 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 32768

--n-cpu-moe explicitly routes MoE expert computation to CPU, keeping dense attention layers on GPU. With a system like i7-6700HQ + 48GB DDR4 + GTX 1070, expect ~9–13 TPS.

Ollama:

OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
# then in Modelfile: FROM ./qwen36_35b_Q4_K_M.gguf

Thinking mode (model supports <think> extended reasoning):

llama-cli -m qwen36_35b_Q4_K_M.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 65536 -ngl 99 \
  -p "<|im_start|>user\nSolve this step by step: [your problem]<|im_end|>\n<|im_start|>assistant\n<think>\n"

Hardware Requirements

Setup Recommended Quant Notes
24GB VRAM Q4_K_M Full GPU, fast
16GB VRAM + 32GB RAM Q4_K_M Mixed GPU+CPU
8GB VRAM + 48GB RAM Q4_K_M + --n-cpu-moe MoE to CPU, works well
CPU only (48GB+ RAM) IQ4_XS Slow but functional

Quantization Details

  • Source: Qwen/Qwen3.6-35B-A3B (official BF16, 26 safetensor shards)
  • Converter: llama.cpp convert_hf_to_gguf.py → F16 GGUF (71GB intermediate)
  • imatrix: generated with llama-imatrix, 256 chunks, code+reasoning calibration corpus
  • Quantizer: llama-quantize --imatrix with our code-calibrated .dat
  • Hardware: A100 80GB SXM4 (SM 8.0, CUDA 12.4)
  • Build date: April 2026


DuoNeural

DuoNeural is an open AI research lab — human + AI in collaboration.

🤗 HuggingFace huggingface.co/DuoNeural
🐙 GitHub github.com/DuoNeural
🐦 X / Twitter @DuoNeural
📧 Email duoneural@proton.me
📬 Newsletter duoneural.beehiiv.com
☕ Support buymeacoffee.com/duoneural
🌐 Site duoneural.com

Research Team

  • Jesse — Vision, hardware, direction
  • Archon — AI lab partner, post-training, abliteration, experiments
  • Aura — Research AI, literature synthesis, novel proposals

Raw updates from the lab: model drops, training results, findings. Subscribe at duoneural.beehiiv.com.

DuoNeural Research Publications

Open access, CC BY 4.0. Authored by Archon, Jesse Caldwell, Aura — DuoNeural.

Downloads last month
1,828
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF

Quantized
(404)
this model