Instructions to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF", dtype="auto") - llama-cpp-python
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF", filename="Qwopus3.6-27B-Coder-IQ2_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M # Run inference directly in the terminal: llama-cli -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M # Run inference directly in the terminal: llama-cli -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M # Run inference directly in the terminal: ./llama-cli -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
Use Docker
docker model run hf.co/pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
- LM Studio
- Jan
- vLLM
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
- SGLang
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Ollama:
ollama run hf.co/pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
- Unsloth Studio
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF to start chatting
- Pi
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Docker Model Runner:
docker model run hf.co/pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
- Lemonade
How to use pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
Run and chat with the model
lemonade run user.Qwopus3.6-27B-Coder-2bit-MTP-GGUF-IQ2_M
List all available models
lemonade list
🧰 1. Files & comparison
Three imatrix-calibrated quants, each with the MTP head bundled at Q8_0. Plain Q2_K (no imatrix) is the no-calibration anchor. FP16 reference: 50.90 GiB (not included; fetch from Jackrong/Qwopus3.6-27B-Coder).
| FP16 (reference) | Q2_K (plain) | IQ2_XS (hybrid) | IQ2_M (hybrid) | Q2_K_S (hybrid) | |
|---|---|---|---|---|---|
| File | n/a | Q2_K.gguf | IQ2_XS.gguf | IQ2_M.gguf | Q2_K_S.gguf |
| Quant | FP16 | Q2_K | IQ2_XS | IQ2_M | Q2_K_S |
| Quality | ❌ | ❌ | ⭐⭐⭐ | ⭐⭐ | |
| Technique | none (reference) | plain (no imatrix) | hybrid imatrix | hybrid imatrix | hybrid imatrix |
| Size (GiB) | 50.90 | 10.40 | 8.89 | 9.74 | 9.96 |
| BPW | 16.000 | 3.269 | 2.794 | 3.062 | 3.133 |
| PPL (general) | 6.4826 | 5.5835 | 9.8866 | 8.5961 | 8.0091 |
| KLD med (general) | 0.00000 | 0.1154 | 0.0950 | 0.0535 | 0.0566 |
| top_p (general) | 100.00% | 79.29% | 78.87% | 83.23% | 83.32% |
SWE-rebench Results
The agentic coding capabilities of each quant were evaluated on 10 real-world coding issues from the nebius/SWE-rebench using the OpenAI Agents SDK pointed at a local llama-server. For each nebius/SWE-rebench issue, the agent gets the problem statement and a live bash tool that shells into a dedicated Docker container with the repo pre-checked out at the failing commit. It iterates by reading files, running tests, editing code until it produces a git diff or hits the step limit. The patch is then graded by actually running the repo's FAIL_TO_PASS test suite inside the container, so pass/fail is real execution, not fuzzy matching. We tried using mini SWE-Agent but it wasn't adequately resolving issues despite have a similar patch rate.
| Metric | Q2_K | IQ2_XS | IQ2_M | Q2_K_S | Q5_K_M |
|---|---|---|---|---|---|
| File | Q2_K.gguf | IQ2_XS.gguf | IQ2_M.gguf | Q2_K_S.gguf | Q5_K_M.gguf |
| Technique | none | imatrix | imatrix | imatrix | none |
| Size (GiB) | 10.40 | 8.89 | 9.74 | 9.96 | 19.50 |
| Repetitions | 3 | 3 | 3 | 3 | 3 |
| Issues | 10 | 10 | 10 | 10 | 10 |
| Patch Rate | 88±12% | 70±10% | 100% | 93±6% | 100% |
| Pass Rate | 30±10% | 27±6% | 63±6% | 57±6% | 57±6% |
| Max Turns | 27±15% | 57±25% | 13±15% | 10±17% | 0% |
| Mean Steps | 58.5±7.6 | 73.1±15.1 | 51.6±8.3 | 46.7±8.1 | 38.6±1.3 |
| Mean Tokens | 1,335K±253K | 1,779K±137K | 784K±260K | 922K±195K | 588K±57K |
| Tool Error Rate | 14.6±6.4% | 9.5±3.6% | 12.6±1.8% | 8.9±1.5% | 12.1±0.2% |
| Mean Wall | 415±98s | 558±182s | 381±66s | 425±259s | 307±34s |
Sampling Parameters:
temperature=0.25, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0, max_tokens=32768, ctx=131072, thinking=true, mtp=true, mtp_draft_n_max=2. Tested on 4060Ti (16Gb)
Definitions:
patched- how many of the 10 issues did the agent produce a patch for (even if it didn't resolve)?resolved- how many of the 10 issues had patches that passed all FAIL_TO_PASS tests?max_turns- how many of the 10 issues hit the 100-step cap without resolving?mean_steps- average number of agentic steps taken (shelling into Docker, reading files,editing code counts as steps)mean_tokens- average number of tokens generated across the entire agentic episodetool_err_rate- how often the agent produced an invalid shell command that couldn't be executed (syntax errors, wrong file paths, etc.)mean_wall- average wall-clock time per episode (capped at 2 hours for those that hit the step limit)
Overall, the IQ2_M quant achieves a strong 63% pass rate on this agentic coding benchmark, which is impressive for a 2-bit model. The high patch rate across all quants suggests that even the weaker ones can still generate plausible patches, but the lower pass rates and higher max turn rates indicate that many of those patches aren't actually resolving the issues. The IQ2_M quant behaves as good as the Q5_K_M albiet with ~20% more steps and tokens, however those additional steps and iterations look to be effective ones that are helping it self-correct and resolve more issues, rather than just looping. When the quant has a high number of mean tokens in combination with a high max turn rate that usually indicates the agent is stuck in a loop. It's worth pointing out that Q5KM never hits its max turn (100) when solving these issues. We recommend running these quants with a repetition penalty of >1 to break it out of loops. Given the variation induced from sampling, we run a few repetitions of each quant and report the mean ± standard deviation across those runs.
🔬 2. How they were made
🚀 3. Usage
Quick start with Ollama
Each quant is exposed as a tag (the filename's quant suffix):
ollama run hf.co/pearsonkyle/Qwopus3.6-27B-Coder-2bit-MTP-GGUF:IQ2_M
# also: :Q2_K_S · :IQ2_XS · :Q2_K
Building llama.cpp from source (GPU)
apt-get update && apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON # -DGGML_CUDA=OFF for CPU/Metal
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp/
MTP needs a recent llama.cpp —
--spec-type draft-mtpsupport was merged in 2026-06. Build from currentmaster.
Running the server with MTP speculative decoding
./llama-server \
--model Qwopus3.6-27B-Coder-IQ2_M.gguf \
--ctx-size 16384 \
--n-gpu-layers 999 \
--spec-type draft-mtp \
--spec-draft-n-max 1 \
--flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--host 0.0.0.0 --port 1234
Drop --spec-type draft-mtp --spec-draft-n-max 1 to run without MTP.
Querying via the OpenAI-compatible API
import json, urllib.request
def ask(content, max_tokens=256):
body = {
"messages": [{"role": "user", "content": content}],
"max_tokens": max_tokens,
# Coder variant emits <think> reasoning. Set enable_thinking False
# (or raise max_tokens) so the answer lands in "content".
"chat_template_kwargs": {"enable_thinking": False},
}
req = urllib.request.Request("http://127.0.0.1:1234/v1/chat/completions",
json.dumps(body).encode(),
{"Content-Type": "application/json"})
return json.loads(urllib.request.urlopen(req).read())["choices"][0]["message"]["content"]
print(ask("Write a Python function that reverses a linked list."))
🪪 4. License & attribution
- Inherits its license from the base model
Jackrong/Qwopus3.6-27B-Coder. Confirm the exact terms and update the frontmatterlicense:before publishing. - Base weights:
Jackrong/Qwopus3.6-27B-Coder(full finetune of Qwen3.6-27B, ships its own MTP head). - Calibration + quantization performed locally with Quant-Tuner; vendored llama.cpp at commit
32782998. - Calibration data (usage logs) scraped using LogMiner.
- Downloads last month
- 1,781
2-bit