🔬 Kleiner — Qwopus-3.6-35B-A3B-Coder Mixed q2_K + imatrix

Black Mesa mixed-quant series · the coder/engineer.

A CPU-offload-aware mixed-precision GGUF of Qwopus-3.6-35B-A3B-Coder — an agentic-coding fine-tune of Qwen3.6-35B-A3B (Opus-flavored, thinking-off). Runs the full 256K context at ~74 tok/s decode on a single 18 GB dual-GPU desktop, by quantizing the CPU-offloaded expert layers to Q2_K while keeping the GPU-resident tensors at Q4_K.

TL;DR: a 19 GB file that runs a strong coding model at full 256K context, at small-context speed — the fast daily driver for agentic/tool-use coding.


What it is

Jackrong's Qwopus-3.6-35B-A3B-Coder is Qwen3.6-35B-A3B (hybrid qwen35moe: gated attention + gated-delta-net SSM, 256 experts, 8+1 active, ~3B active/token) fine-tuned for agentic coding — repository tasks, debugging traces, tool schemas, multi-turn feedback — with thinking-off behavior to cut token waste in agent loops. It reports SWE-bench 62.4% (thinking off) and, per its card, beats Ornith-1.0 on legit-request compliance and multi-turn orchestration.

This build applies the mixed q2_K + imatrix quantization so you get that coding model at the fast 256K profile on limited VRAM.

Recipe

  • Base: Qwopus-3.6-35B-A3B-Coder (qwen35moe, 40 blocks + 1 nextn/MTP layer).
  • Source → output: requantized from the Q8_0 with an importance matrix computed on the Q8 itself (~61K tokens), q2_K on the offloaded expert layers, q4_K GPU-resident, q6_K output.
  • Mixed layout:
    • ffn_*_exps on blocks 13–26Q2_K (42 tensors — the CPU-offloaded set)
    • everything else → Q4_K · output-class → Q6_K
  • 4.88 bpw effective, ~19 GB, 256K native context.

See the Gordon (base Qwen3.6) card for the full rationale on why offloaded-layer byte count (not file size) drives decode speed.

Benchmarks

RTX 3060 Ti (8 GB) + RTX 3080 (10 GB), Ryzen 5950X, 46 GB DDR4-2733, ik_llama.cpp, q4_0 KV, flash-attn on:

Metric This mixed q2_K Qwopus Q8_0
Decode ~74 tok/s @256K ~32 tok/s @64K
Context 262144 65536
VRAM ~16.8 GB 15.6 GB (+29 GB RAM)
Output clean code, --reasoning off clean

How to run (ik_llama.cpp)

The -ot override is required (pins the Q2_K layers 13–26 to CPU). --reasoning off matches the model's thinking-off design (snappy agent loops, direct code).

./llama-server \
  -m Qwopus3.6-35B-A3B-Coder-mixed-q2k.gguf \
  --jinja --cache-type-k q4_0 --cache-type-v q4_0 --flash-attn on \
  --ctx-size 262144 --parallel 1 --n-gpu-layers 99 \
  -ot 'blk\.(1[3-9]|2[0-6])\.ffn_(up|gate|down)_exps\.weight=CPU' \
  --tensor-split 44,56 --ubatch-size 256 \
  --reasoning off --reasoning-budget 0 \
  --no-mmap --threads 8 --no-warmup --port 8000

A note on the MTP head

The base has a built-in MTP (multi-token-prediction) layer (blk.40.nextn.*) claiming 1.4–2.2× faster generation. That speedup needs vLLM/SGLang — in llama.cpp/ik_llama the MTP head is ignored and the model runs as a normal A3B (ik_llama's MTP support is gated to the gemma4 arch, not qwen35moe). The tensor is harmlessly carried in this quant.

Intended use & limitations

  • Target: local agentic/tool-use coding at full 256K on ~16–18 GB VRAM, at usable speed.
  • The Q2_K expert layers are the quality floor; for maximum fidelity use the Q8_0.
  • Inherits the capabilities and biases of the base Qwopus-Coder. Pure quantization — no fine-tuning or alignment changes.

Provenance

  • Original base: Qwen3.6-35B-A3B by Qwen (Apache-2.0).
  • Coding fine-tune: Jackrong (Qwopus-3.6-35B-A3B-Coder).
  • Mixed quantization + imatrix + tuning: xero0000, June 2026.

Released under the base model's Apache-2.0 license (quantization does not change the license).

Downloads last month
2,025
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xero0000/Kleiner-Coder-35B-mixed-q2k