Instructions to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF", dtype="auto")

llama-cpp-python

How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF",
	filename="Qwopus3.6-27B-Coder-IQ2_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
# Run inference directly in the terminal:
llama-cli -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
# Run inference directly in the terminal:
llama-cli -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
# Run inference directly in the terminal:
./llama-cli -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M

Use Docker

docker model run hf.co/pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M

LM Studio
Jan

vLLM

How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M

SGLang

How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Ollama:
```
ollama run hf.co/pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
```

Unsloth Studio

How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF to start chatting

How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
```

Lemonade

How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M

Run and chat with the model

lemonade run user.Qwopus3.6-27B-Coder-2bit-MTP-GGUF-IQ2_M

List all available models

lemonade list

🧊 Jackrong/Qwopus3.6-27B-Coder 2-bit

imatrix + MTP

📦 8.89 / 9.74 / 9.96 GiB IQ2_XS / IQ2_M / Q2_K_S ⚡ MTP bundled (Q8) · 1.26× · 79.9% accept · @n=1 🏗️ llama.cpp 32782998 🏅 KLD 0.053 · top_p 83.2%

🧊 What this is

Three aggressively compressed (under 3.2 bits per weight) quantizations of Jackrong/Qwopus3.6-27B-Coder, each calibrated with a hybrid importance matrix from real usage logs + wiki text, and each shipping the model's own Multi-Token-Prediction (MTP) draft head bundled in at Q8_0 for built-in speculative decoding. The imatrix spends the 2-bit codebook's precision where the model is most sensitive; the MTP head — kept near-lossless at Q8 while the trunk goes 2-bit — drafts the next token for a ~1.26× decode speedup at 79.9% acceptance, no separate draft model required. Plain GGUF, no custom runtime.

📉 ~5× smaller on disk8.9–10.0 GiB on disk (incl. the bundled MTP head) vs 50.9 GiB for FP16. Tuned for English + Python agentic-coding workloads (see calibration scope below).

⚡ 1.26× faster decodeBuilt-in MTP speculative decoding: 22.9 vs 18.1 tok/s on Metal (IQ2_M, n-max=1), 79.9% draft acceptance.

🧰 1. Files & comparison

Three imatrix-calibrated quants, each with the MTP head bundled at Q8_0. Plain Q2_K (no imatrix) is the no-calibration anchor. FP16 reference: 50.90 GiB (not included; fetch from Jackrong/Qwopus3.6-27B-Coder).

	FP16 (reference)	Q2_K (plain)	IQ2_XS (hybrid)	IQ2_M (hybrid)	Q2_K_S (hybrid)
File	n/a	Q2_K.gguf	IQ2_XS.gguf	IQ2_M.gguf	Q2_K_S.gguf
Quant	FP16	Q2_K	IQ2_XS	IQ2_M	Q2_K_S
Quality		❌	❌	⭐⭐⭐	⭐⭐
Technique	none (reference)	plain (no imatrix)	hybrid imatrix	hybrid imatrix	hybrid imatrix
Size (GiB)	50.90	10.40	8.89	9.74	9.96
BPW	16.000	3.269	2.794	3.062	3.133

PPL (general)	6.4826	5.5835	9.8866	8.5961	8.0091
KLD med (general)	0.00000	0.1154	0.0950	0.0535	0.0566
top_p (general)	100.00%	79.29%	78.87%	83.23%	83.32%

⚠️ Caveat. Sub-3.2-bpw quants of a 27B model. Strong for their size, but not a substitute for FP16 / Q4_K_M / Q5_K_M when you have the VRAM. Use them when memory is the binding constraint.

📋 Calibration scope — English & Python, agentic coding. The importance matrix (and the windowed packing that shaped it) was calibrated on real agentic-coding sessions that are overwhelmingly English-language and Python-centric, captured from Claude Code, opencode, and qwen code. At 2 bits the codebook's precision is spent where those logs put it: English prompts and Python-flavored tool use (read / edit / bash / grep / write, etc.). Expect weaker fidelity on other natural languages, non-Python ecosystems, and non-coding / general-chat workloads.

SWE-rebench Results

The agentic coding capabilities of each quant were evaluated on 10 real-world coding issues from the nebius/SWE-rebench using the OpenAI Agents SDK pointed at a local llama-server. For each nebius/SWE-rebench issue, the agent gets the problem statement and a live bash tool that shells into a dedicated Docker container with the repo pre-checked out at the failing commit. It iterates by reading files, running tests, editing code until it produces a git diff or hits the step limit. The patch is then graded by actually running the repo's FAIL_TO_PASS test suite inside the container, so pass/fail is real execution, not fuzzy matching. We tried using mini SWE-Agent but it wasn't adequately resolving issues despite have a similar patch rate.

Metric	Q2_K	IQ2_XS	IQ2_M	Q2_K_S	Q5_K_M
File	Q2_K.gguf	IQ2_XS.gguf	IQ2_M.gguf	Q2_K_S.gguf	Q5_K_M.gguf
Technique	none	imatrix	imatrix	imatrix	none
Size (GiB)	10.40	8.89	9.74	9.96	19.50
Repetitions	3	3	3	3	3
Issues	10	10	10	10	10
Patch Rate	88±12%	70±10%	100%	93±6%	100%
Pass Rate	30±10%	27±6%	63±6%	57±6%	57±6%
Max Turns	27±15%	57±25%	13±15%	10±17%	0%
Mean Steps	58.5±7.6	73.1±15.1	51.6±8.3	46.7±8.1	38.6±1.3
Mean Tokens	1,335K±253K	1,779K±137K	784K±260K	922K±195K	588K±57K
Tool Error Rate	14.6±6.4%	9.5±3.6%	12.6±1.8%	8.9±1.5%	12.1±0.2%
Mean Wall	415±98s	558±182s	381±66s	425±259s	307±34s

Sampling Parameters: temperature=0.25, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_tokens=32768, ctx=131072, thinking=true, mtp=true, mtp_draft_n_max=2. Tested on 4060Ti (16Gb)

Definitions:

patched - how many of the 10 issues did the agent produce a patch for (even if it didn't resolve)?
resolved - how many of the 10 issues had patches that passed all FAIL_TO_PASS tests?
max_turns - how many of the 10 issues hit the 100-step cap without resolving?
mean_steps - average number of agentic steps taken (shelling into Docker, reading files,editing code counts as steps)
mean_tokens - average number of tokens generated across the entire agentic episode
tool_err_rate - how often the agent produced an invalid shell command that couldn't be executed (syntax errors, wrong file paths, etc.)
mean_wall - average wall-clock time per episode (capped at 2 hours for those that hit the step limit)

Overall, the IQ2_M quant achieves a strong 63% pass rate on this agentic coding benchmark, which is impressive for a 2-bit model. The high patch rate across all quants suggests that even the weaker ones can still generate plausible patches, but the lower pass rates and higher max turn rates indicate that many of those patches aren't actually resolving the issues. The IQ2_M quant behaves as good as the Q5_K_M albiet with ~20% more steps and tokens, however those additional steps and iterations look to be effective ones that are helping it self-correct and resolve more issues, rather than just looping. When the quant has a high number of mean tokens in combination with a high max turn rate that usually indicates the agent is stuck in a loop. It's worth pointing out that Q5KM never hits its max turn (100) when solving these issues. We recommend running these quants with a repetition penalty of >1 to break it out of loops. Given the variation induced from sampling, we run a few repetitions of each quant and report the mean ± standard deviation across those runs.

🔬 2. How they were made

🧮 2.1 Hybrid importance matrix

At 2-bit the quantizer must decide where to spend its limited precision. An importance matrix measures, per input channel, how much that channel drives each layer's output on a calibration corpus, and tells llama-quantize to preserve the high-impact channels. This release uses a hybrid imatrix blending activation energy E[a²] with weight-column energy ‖W[:, c]‖² · E[a²], collected at ctx=4096. Linear-attention / SSM tensors (this is a Qwen3.6 hybrid architecture) pass through with raw E[a²]. The output is a standard GGUF with no runtime overhead.

⚡ 2.2 Bundled MTP (multi-token prediction)

Qwopus3.6 ships a trained MTP draft head (one nextn layer, blk.64) that predicts the next token from the trunk's hidden state. llama.cpp runs it as built-in speculative decoding (--spec-type draft-mtp): the head drafts, the trunk verifies in parallel, and accepted drafts skip a full decode step.

We keep the MTP head near-lossless at Q8_0 while the trunk goes 2-bit — the head is tiny relative to the model, and a 2-bit draft head would draft poorly. Measured on Metal (IQ2_M, n-max=1, holdout prompts):

Config	Decode tok/s	Draft acceptance
MTP on (n-max=1)	22.9 ± 0.7	79.9%
baseline (off)	18.1 ± 1.7	—

→ 1.26× speedup on Metal. Qwen3.6 exposes one nextn layer, so --spec-draft-n-max 1 is optimal (higher values don't help). GPU bandwidth matters — the upstream Qwen3.6 figure is ~1.66× on an RTX 5090. See MTP/README.md for details.

📚 2.3 Calibration & evaluation data

Calibration and every eval corpus are disjoint by construction — the tool-call eval is the held-out 10% of sessions, windowed exactly like calibration but never seen by it — so §1 measures generalization, not fit. All shipped under calibration_data/.

Corpus	Source	Used for
Calibration	~500k tokens of usage-log text (windowed) + all of `wiki.test.raw`	hybrid imatrix collection
Eval — tools (in-distribution)	held-out logtrain session slice (10%), windowed like calibration but disjoint from it	*§1 tools* columns (PPL · KLD · top_p)**
Eval — general	`combined_en_tiny` (broad English) from the same eaddario dataset	*§1 gen* columns (PPL · KLD · top_p)**

🚀 3. Usage

Quick start with Ollama

Each quant is exposed as a tag (the filename's quant suffix):

ollama run hf.co/pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
# also: :Q2_K_S  ·  :IQ2_XS  ·  :Q2_K

Building llama.cpp from source (GPU)

apt-get update && apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON   # -DGGML_CUDA=OFF for CPU/Metal
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp/

MTP needs a recent llama.cpp — --spec-type draft-mtp support was merged in 2026-06. Build from current master.

Running the server with MTP speculative decoding

 ./llama-server \
    --model Qwopus3.6-27B-Coder-IQ2_M.gguf \
    --ctx-size 16384 \
    --n-gpu-layers 999 \
    --spec-type draft-mtp \
    --spec-draft-n-max 1 \
    --flash-attn on \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --host 0.0.0.0 --port 1234

Drop --spec-type draft-mtp --spec-draft-n-max 1 to run without MTP.

Querying via the OpenAI-compatible API

import json, urllib.request

def ask(content, max_tokens=256):
    body = {
        "messages": [{"role": "user", "content": content}],
        "max_tokens": max_tokens,
        # Coder variant emits <think> reasoning. Set enable_thinking False
        # (or raise max_tokens) so the answer lands in "content".
        "chat_template_kwargs": {"enable_thinking": False},
    }
    req = urllib.request.Request("http://127.0.0.1:1234/v1/chat/completions",
                                 json.dumps(body).encode(),
                                 {"Content-Type": "application/json"})
    return json.loads(urllib.request.urlopen(req).read())["choices"][0]["message"]["content"]

print(ask("Write a Python function that reverses a linked list."))

🪪 4. License & attribution

Inherits its license from the base model Jackrong/Qwopus3.6-27B-Coder. Confirm the exact terms and update the frontmatter license: before publishing.
Base weights: Jackrong/Qwopus3.6-27B-Coder (full finetune of Qwen3.6-27B, ships its own MTP head).
Calibration + quantization performed locally with Quant-Tuner; vendored llama.cpp at commit 32782998.
Calibration data (usage logs) scraped using LogMiner.

Downloads last month: 1,781

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

2-bit

Model tree for pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF

Base model

Jackrong/Qwopus3.6-27B-v2

Adapter

Jackrong/Qwopus3.6-27B-Coder

Quantized

(17)

this model