🧊 Jackrong/Qwopus3.6-27B-Coder 2-bit

imatrix + MTP
📦 8.89 / 9.74 / 9.96 GiB IQ2_XS / IQ2_M / Q2_K_S ⚡ MTP bundled (Q8) · 1.26× · 79.9% accept · @n=1 🏗️ llama.cpp 32782998 🏅 KLD 0.053 · top_p 83.2%

🧊 What this is

Three aggressively compressed (under 3.2 bits per weight) quantizations of Jackrong/Qwopus3.6-27B-Coder, each calibrated with a hybrid importance matrix from real usage logs + wiki text, and each shipping the model's own Multi-Token-Prediction (MTP) draft head bundled in at Q8_0 for built-in speculative decoding. The imatrix spends the 2-bit codebook's precision where the model is most sensitive; the MTP head — kept near-lossless at Q8 while the trunk goes 2-bit — drafts the next token for a ~1.26× decode speedup at 79.9% acceptance, no separate draft model required. Plain GGUF, no custom runtime.

📉 ~5× smaller on disk8.9–10.0 GiB on disk (incl. the bundled MTP head) vs 50.9 GiB for FP16. Tuned for English + Python agentic-coding workloads (see calibration scope below).
⚡ 1.26× faster decodeBuilt-in MTP speculative decoding: 22.9 vs 18.1 tok/s on Metal (IQ2_M, n-max=1), 79.9% draft acceptance.

🧰 1. Files & comparison

Three imatrix-calibrated quants, each with the MTP head bundled at Q8_0. Plain Q2_K (no imatrix) is the no-calibration anchor. FP16 reference: 50.90 GiB (not included; fetch from Jackrong/Qwopus3.6-27B-Coder).

FP16 (reference) Q2_K (plain) IQ2_XS (hybrid) IQ2_M (hybrid) Q2_K_S (hybrid)
File n/a Q2_K.gguf IQ2_XS.gguf IQ2_M.gguf Q2_K_S.gguf
Quant FP16 Q2_K IQ2_XS IQ2_M Q2_K_S
Quality ⭐⭐⭐ ⭐⭐
Technique none (reference) plain (no imatrix) hybrid imatrix hybrid imatrix hybrid imatrix
Size (GiB) 50.90 10.40 8.89 9.74 9.96
BPW 16.000 3.269 2.794 3.062 3.133
PPL (general) 6.4826 5.5835 9.8866 8.5961 8.0091
KLD med (general) 0.00000 0.1154 0.0950 0.0535 0.0566
top_p (general) 100.00% 79.29% 78.87% 83.23% 83.32%
⚠️ Caveat. Sub-3.2-bpw quants of a 27B model. Strong for their size, but not a substitute for FP16 / Q4_K_M / Q5_K_M when you have the VRAM. Use them when memory is the binding constraint.
📋 Calibration scope — English & Python, agentic coding. The importance matrix (and the windowed packing that shaped it) was calibrated on real agentic-coding sessions that are overwhelmingly English-language and Python-centric, captured from Claude Code, opencode, and qwen code. At 2 bits the codebook's precision is spent where those logs put it: English prompts and Python-flavored tool use (read / edit / bash / grep / write, etc.). Expect weaker fidelity on other natural languages, non-Python ecosystems, and non-coding / general-chat workloads.

SWE-rebench Results

The agentic coding capabilities of each quant were evaluated on 10 real-world coding issues from the nebius/SWE-rebench using the OpenAI Agents SDK pointed at a local llama-server. For each nebius/SWE-rebench issue, the agent gets the problem statement and a live bash tool that shells into a dedicated Docker container with the repo pre-checked out at the failing commit. It iterates by reading files, running tests, editing code until it produces a git diff or hits the step limit. The patch is then graded by actually running the repo's FAIL_TO_PASS test suite inside the container, so pass/fail is real execution, not fuzzy matching. We tried using mini SWE-Agent but it wasn't adequately resolving issues despite have a similar patch rate.

Metric Q2_K IQ2_XS IQ2_M Q2_K_S Q5_K_M
File Q2_K.gguf IQ2_XS.gguf IQ2_M.gguf Q2_K_S.gguf Q5_K_M.gguf
Technique none imatrix imatrix imatrix none
Size (GiB) 10.40 8.89 9.74 9.96 19.50
Repetitions 3 3 3 3 3
Issues 10 10 10 10 10
Patch Rate 88±12% 70±10% 100% 93±6% 100%
Pass Rate 30±10% 27±6% 63±6% 57±6% 57±6%
Max Turns 27±15% 57±25% 13±15% 10±17% 0%
Mean Steps 58.5±7.6 73.1±15.1 51.6±8.3 46.7±8.1 38.6±1.3
Mean Tokens 1,335K±253K 1,779K±137K 784K±260K 922K±195K 588K±57K
Tool Error Rate 14.6±6.4% 9.5±3.6% 12.6±1.8% 8.9±1.5% 12.1±0.2%
Mean Wall 415±98s 558±182s 381±66s 425±259s 307±34s

Sampling Parameters: temperature=0.25, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_tokens=32768, ctx=131072, thinking=true, mtp=true, mtp_draft_n_max=2. Tested on 4060Ti (16Gb)

Definitions:

  • patched - how many of the 10 issues did the agent produce a patch for (even if it didn't resolve)?
  • resolved - how many of the 10 issues had patches that passed all FAIL_TO_PASS tests?
  • max_turns - how many of the 10 issues hit the 100-step cap without resolving?
  • mean_steps - average number of agentic steps taken (shelling into Docker, reading files,editing code counts as steps)
  • mean_tokens - average number of tokens generated across the entire agentic episode
  • tool_err_rate - how often the agent produced an invalid shell command that couldn't be executed (syntax errors, wrong file paths, etc.)
  • mean_wall - average wall-clock time per episode (capped at 2 hours for those that hit the step limit)

Overall, the IQ2_M quant achieves a strong 63% pass rate on this agentic coding benchmark, which is impressive for a 2-bit model. The high patch rate across all quants suggests that even the weaker ones can still generate plausible patches, but the lower pass rates and higher max turn rates indicate that many of those patches aren't actually resolving the issues. The IQ2_M quant behaves as good as the Q5_K_M albiet with ~20% more steps and tokens, however those additional steps and iterations look to be effective ones that are helping it self-correct and resolve more issues, rather than just looping. When the quant has a high number of mean tokens in combination with a high max turn rate that usually indicates the agent is stuck in a loop. It's worth pointing out that Q5KM never hits its max turn (100) when solving these issues. We recommend running these quants with a repetition penalty of >1 to break it out of loops. Given the variation induced from sampling, we run a few repetitions of each quant and report the mean ± standard deviation across those runs.


🔬 2. How they were made

🧮 2.1 Hybrid importance matrix

At 2-bit the quantizer must decide where to spend its limited precision. An importance matrix measures, per input channel, how much that channel drives each layer's output on a calibration corpus, and tells llama-quantize to preserve the high-impact channels. This release uses a hybrid imatrix blending activation energy E[a²] with weight-column energy ‖W[:, c]‖² · E[a²], collected at ctx=4096. Linear-attention / SSM tensors (this is a Qwen3.6 hybrid architecture) pass through with raw E[a²]. The output is a standard GGUF with no runtime overhead.

⚡ 2.2 Bundled MTP (multi-token prediction)

Qwopus3.6 ships a trained MTP draft head (one nextn layer, blk.64) that predicts the next token from the trunk's hidden state. llama.cpp runs it as built-in speculative decoding (--spec-type draft-mtp): the head drafts, the trunk verifies in parallel, and accepted drafts skip a full decode step.

We keep the MTP head near-lossless at Q8_0 while the trunk goes 2-bit — the head is tiny relative to the model, and a 2-bit draft head would draft poorly. Measured on Metal (IQ2_M, n-max=1, holdout prompts):

ConfigDecode tok/sDraft acceptance
MTP on (n-max=1)22.9 ± 0.779.9%
baseline (off)18.1 ± 1.7

→ 1.26× speedup on Metal. Qwen3.6 exposes one nextn layer, so --spec-draft-n-max 1 is optimal (higher values don't help). GPU bandwidth matters — the upstream Qwen3.6 figure is ~1.66× on an RTX 5090. See MTP/README.md for details.

📚 2.3 Calibration & evaluation data

Calibration and every eval corpus are disjoint by construction — the tool-call eval is the held-out 10% of sessions, windowed exactly like calibration but never seen by it — so §1 measures generalization, not fit. All shipped under calibration_data/.

CorpusSourceUsed for
Calibration~500k tokens of usage-log text (windowed) + all of wiki.test.rawhybrid imatrix collection
Eval — tools (in-distribution)held-out logtrain session slice (10%), windowed like calibration but disjoint from it§1 tools columns (PPL · KLD · top_p)
Eval — generalcombined_en_tiny (broad English) from the same eaddario dataset§1 gen columns (PPL · KLD · top_p)

🚀 3. Usage

Quick start with Ollama

Each quant is exposed as a tag (the filename's quant suffix):

ollama run hf.co/pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
# also: :Q2_K_S  ·  :IQ2_XS  ·  :Q2_K

Building llama.cpp from source (GPU)

apt-get update && apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON   # -DGGML_CUDA=OFF for CPU/Metal
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp/

MTP needs a recent llama.cpp--spec-type draft-mtp support was merged in 2026-06. Build from current master.

Running the server with MTP speculative decoding

 ./llama-server \
    --model Qwopus3.6-27B-Coder-IQ2_M.gguf \
    --ctx-size 16384 \
    --n-gpu-layers 999 \
    --spec-type draft-mtp \
    --spec-draft-n-max 1 \
    --flash-attn on \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --host 0.0.0.0 --port 1234

Drop --spec-type draft-mtp --spec-draft-n-max 1 to run without MTP.

Querying via the OpenAI-compatible API

import json, urllib.request

def ask(content, max_tokens=256):
    body = {
        "messages": [{"role": "user", "content": content}],
        "max_tokens": max_tokens,
        # Coder variant emits <think> reasoning. Set enable_thinking False
        # (or raise max_tokens) so the answer lands in "content".
        "chat_template_kwargs": {"enable_thinking": False},
    }
    req = urllib.request.Request("http://127.0.0.1:1234/v1/chat/completions",
                                 json.dumps(body).encode(),
                                 {"Content-Type": "application/json"})
    return json.loads(urllib.request.urlopen(req).read())["choices"][0]["message"]["content"]

print(ask("Write a Python function that reverses a linked list."))

🪪 4. License & attribution

  • Inherits its license from the base model Jackrong/Qwopus3.6-27B-Coder. Confirm the exact terms and update the frontmatter license: before publishing.
  • Base weights: Jackrong/Qwopus3.6-27B-Coder (full finetune of Qwen3.6-27B, ships its own MTP head).
  • Calibration + quantization performed locally with Quant-Tuner; vendored llama.cpp at commit 32782998.
  • Calibration data (usage logs) scraped using LogMiner.
Downloads last month
1,781
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF

Quantized
(17)
this model