Image-Text-to-Text
Transformers
GGUF
llama.cpp
vision
multimodal
text-generation-inference
unsloth
conversational
qwen3_6
reasoning
chain-of-thought
lora
sft
agent
tool-use
function-calling
coder

🪐 Qwopus-3.6-27B-Coder

Coder SFT Release

Agentic Coding & Tool-Use Reasoning Model Fine-Tuned on Qwopus3.6-27B-v2

🧬 Trace Inversion & Negentropy 🧠 27B Dense Model ⚡ Agentic Coding 🛠️ Tool Calling & Agent 🏆 SWE-bench Verified: 67.0% (off-thinking)

💡 What is Qwopus-3.6-27B-Coder?

🪐 Qwopus-3.6-27B-Coder is a reasoning-enhanced agentic coding model built on top of Qwopus3.6-27B-v2. It inherits the powerful reasoning foundation of the v2 base — which achieved 87.43% MMLU-Pro (300ex) and 75.25% SWE-bench Verified — and further specializes it for agentic code generation, structured tool calling, debugging, and instruction-following in developer workflows. The model is designed to excel at repository-level coding tasks, multi-turn tool orchestration, and complex logical reasoning under realistic agent environments.

🧩 Agentic Coding Optimized for repository-level coding, debugging, patch generation, and structured multi-step development workflows.
🛠️ Tool Calling Learns from real agent trajectories with tool definitions, tool calls, and environment feedback for robust multi-turn execution.
🧬 Trace Inversion Inherits the full Qwopus training recipe with reconstructed step-by-step reasoning trajectories from Claude Opus.
🚀 27B Scale Dense 27B parameters with native long-context support, delivering deep reasoning with practical single-GPU deployability.

Community Release Notice: Qwopus-3.6-27B-Coder is an experimental community release intended for research, evaluation, and agent workflow exploration. It has not undergone full safety evaluation or broad general-domain benchmarking.

Benchmark Status: The first completed benchmark is SWE-bench Verified full 500 in thinking-off / no-thinking mode, where the Q5_K_M 27B GGUF run resolved 335/500 = 67.0%. Other benchmark suites remain pending and will be updated as testing completes.


💡 1. Base Model, Training Stack & Collaboration

🧠 1.1 Base Model: Qwopus3.6-27B-v2

Qwopus3.6-27B-v2 is a reasoning-enhanced dense language model built on Qwen3.6-27B. Through a multi-stage curriculum learning pipeline and Trace Inversion augmentation, it achieves strong performance across knowledge, coding, and reasoning benchmarks. This coder variant inherits that foundation and extends it with specialized coding and tool-use data.

Attribute Specifications & Details
🧠 Architecture Dense Transformer / 27 Billion Parameters
🏢 Base Developer Alibaba Cloud (DAMO Academy) — Qwen3.6-27B
🎯 Primary Focus Agentic coding, tool-use stability, code debugging, structured instruction following, repository-level tasks
🧬 Distillation Strategy Trace Inversion + high-quality agent trajectories + curriculum SFT
📄 Context Window Native support up to 32K tokens (fine-tuning target); compatible with longer contexts via RoPE/YaRN scaling
🧪 1.2 Hardware Cooperation & Joint Collaboration
This project is built in close collaboration and joint effort with engineer Kyle Hessling, whose hardware infrastructure and training support made stable 27B-scale fine-tuning and evaluation possible.
👉 You can follow him for hardware and model training updates on X / Twitter: @KyleHessling1
🦥 1.3 Fine-Tuning Framework (Unsloth)
The model training workflow is accelerated and memory-optimized with Unsloth. Special thanks to the Unsloth team for making efficient large-model fine-tuning accessible.
👉 Documentation and fine-tuning guidance: unsloth.ai/docs
1.4 MTP Variant: Faster Speculative Decoding
A Multi-Token Prediction (MTP) variant of this model is also available, featuring auxiliary prediction heads (draft=2) for speculative decoding. Based on the Qwopus3.6-27B-v2-MTP benchmark, the MTP variant achieved ~1.66x speedup over standard decoding with preserved accuracy. See the Qwopus3.6-27B-v2-MTP model card for detailed MTP performance analysis.
🌟The custom MTP heads processing pipeline is open-sourced in qwen-mtp-gguf. If you find this toolkit helpful, please consider leaving a star on GitHub!

📖 2. Background & Motivation

🎯 2.1 Why a 27B Coder Model?
The Qwopus coder line has demonstrated strong results at the 4B and 9B scales. The 27B coder variant represents a significant leap in reasoning depth, code generation quality, and tool-use robustness. At 27B parameters, the model has sufficient capacity to internalize complex repository structures, multi-file dependencies, and nuanced tool-calling patterns — while remaining deployable on a single GPU (e.g., RTX 5090). This scale bridges the gap between compact local models and expensive API-based solutions, making it suitable for production agentic coding workflows.
🧬 2.2 Trace Inversion & Agent Behavior
Commercial and frontier models often expose only compressed reasoning summaries. Qwopus-style training uses Trace Inversion to reconstruct these compressed "Reasoning Bubbles" into fuller learnable reasoning traces. For coding, this is paired with agent trajectories that include tool definitions, tool calls, and real feedback, teaching the model to reason through interactive work rather than only produce static answers.



This model integrates:

  • claude-opus-4.6-traceInversion-9000x: 9,000 high-value, fully reconstructed step-by-step reasoning trajectories.
  • claude-opus-4.7-traceInversion-5000x: 5,000 complex multi-turn logic and mathematics samples optimized for negative entropy reconstruction.
  • lambda/hermes-agent-reasoning-traces: ~10,000 high-quality multi-turn tool-calling trajectories from GLM-5.1 and kimi-4.6 models.

📦 2.3 Special Dataset: Trace Inversion & Agent Traces
Trace Inversion: Uses a specialized logical reconstructor, Trace-Inverter-4B, to reverse-engineer compressed reasoning bubbles into complete, step-by-step learnable CoT chains. This approach addresses the "Information Entropy Trap" — where direct imitation of compressed summaries leads to reasoning fractures — by ensuring the model learns continuous, rigorous logical derivations.



Agent Traces (lambda/hermes-agent-reasoning-traces): Each sample contains real multi-turn tool execution results (not fabricated outputs), with step-by-step reasoning inside <think> tags. Coverage includes:

  • Terminal & Coding: Script writing, debugging, environment configuration
  • Repository Tasks: Bug fixing, refactoring, code review
  • Browser Automation: Web navigation, scraping, form filling
  • Agent Tools: Memory persistence, task delegation, skill management


📊 3. Performance Benchmarks

📊 Evaluation & Performance Metrics

First completed result: SWE-bench Verified full 500, evaluated in no-thinking mode for fast local agentic coding.

No-Thinking SWE-bench Result This benchmark was intentionally run with thinking disabled. The goal is to show the model's practical coding ability when used as a fast local agent, without relying on long visible reasoning traces. On an RTX 5090 with MTP enabled, the model runs at approximately 100 tokens/sec, making this result especially relevant for interactive development workflows.
SWE-bench Verified 67.0% 335 / 500 resolved
Inference Mode Thinking Off no visible CoT required
Local Throughput ~100 t/s RTX 5090 + MTP
Evaluation Build Q5_K_M 27B GGUF quant
Evaluation setup: SWE-bench Verified full 500, Qwopus-3.6-27B-Coder Q5_K_M GGUF, thinking-off / no-thinking mode. Final score: 335/500 = 67.0%.
💻 3.1 SWE-bench Verified: Full 500 No-Thinking Result

SWE-bench Verified measures whether a model can solve real GitHub issues by editing repository code and passing the hidden tests. In this run, Qwopus-3.6-27B-Coder solved 335 out of 500 verified tasks while running in no-thinking mode, prioritizing direct action quality and local speed over long explicit reasoning.

Metric Result Notes
Final score 335/500 = 67.0% Full SWE-bench Verified 500-task split
Mode Thinking off No long visible chain-of-thought during evaluation
Quantization Q5_K_M GGUF Local 27B quantized deployment
Throughput ~100 tokens/sec Observed on RTX 5090 with MTP enabled
🧩 3.2 Repository-Level Breakdown

The result is strongest on practical library-maintenance tasks such as scikit-learn, xarray, requests, and Django, while also showing solid coverage on symbolic mathematics, test infrastructure, documentation tooling, and plotting libraries.

Repository
Resolved
Rate
scikit-learn
27/32
84%
pydata/xarray
18/22
82%
psf/requests
6/8
75%
django
166/231
72%
sympy
48/75
64%
pytest
12/19
63%
sphinx-doc
26/44
59%
matplotlib
20/34
59%
astropy
9/22
41%
pylint
2/10
20%
⚖️ 3.3 SWE-bench Verified Reference Comparison
Important comparison note: the reference scores below are from external model reports and are generally thinking-enabled or harness-specific where noted. Qwopus-3.6-27B-Coder is shown here as a no-thinking, quantized local run, so this table should be read as positioning context rather than a strict same-mode leaderboard.
Model Thinking Mode SWE-bench Verified Context
Qwopus-3.6-27B-Coder Off / No-thinking 67.0 Q5_K_M, RTX 5090 + MTP, ~100 t/s
OpenAI GPT-5On70.1Thinking-on reference
OpenAI GPT-5 miniOn59.8Thinking-on reference
OpenAI GPT-5 nanoOn34.8Thinking-on reference
GLM-4.7On70.6OpenHands reference
GLM-4.5-AirOn57.6OpenHands reference
Qwen3-Coder-30B-A3B-Instruct (2025-07)Off / No-thinking70.3No-thinking reference
Claude 4.0 OpusOn67.6Thinking-on reference
Claude 4.5 OpusOn80.9Thinking-on reference
Qwen3.6-27BOn77.2Thinking-on reference
Qwen3.5-397B-A17BOn76.2Thinking-on reference
Qwen3.5-27BOn75.0Thinking-on reference
Qwen3.6-35B-A3BOn73.4Thinking-on reference
Gemma4-31BOn52.0Thinking-on reference
Gemma4-26B-A4BOn17.4Thinking-on reference
🎮 3.4 Live Thinking-Disabled Demo: Boat Survival

Kyle Hessling also tested Qwopus-3.6-27B-Coder in a small interactive game environment with thinking disabled. The demo is a practical smoke test for fast decision-making, instruction adherence, and local responsiveness beyond static benchmark tables.

Boat Survival thinking-disabled Qwopus-3.6-27B-Coder demo screenshot
Takeaway: The headline is not that this no-thinking local run beats every thinking-enabled frontier reference. The important result is that a quantized 27B local coder can reach 67.0% on the full SWE-bench Verified split while staying fast enough for interactive agent loops. This makes Qwopus-3.6-27B-Coder a practical option for developers who want strong repository-level repair performance without paying the latency cost of long reasoning mode.

🗺️ 4. Training & Data Pipeline Overview

The training process fuses Trace Inversion data augmentation with a Three-Stage Curriculum Learning pipeline. The core engineering focuses on expanding context length gradually while training on reconstructed reasoning traces and real agent trajectories to keep the output format stable.

       [ 🗺️ Trace Inversion: Reconstructing Distillation Workflow ]

  A. Surrogate Model Training (Trace Inverter)
     Open-source Model (GLM-5.1 / DS-V4) ──► Complete Reasoning Chain ──► [ Qwen3-235B Compression ] ──► Reasoning Bubbles
                                              │                                   │
                                              └──────────► [ Training ] ◄─────────┘
                                                   (Base: Qwen3-4B-Instruct)
                                                   (Result: Trace-Inverter-4B)

  B. Inversion Phase: Reconstructing Claude-4.7-Max
     _______________________________________________________
    |                                                       |
    |  Claude-4.7-Max API ──► Compressed Bubbles + Answer   |
    |_______________________________________________________|
                      │
                      ▼
    [ 🧠 Trace-Inverter-4B (Logic Reconstructor) ] ──► Synthetic Deep Reasoning Trace (Learnable CoT)
                      │
                      ▼
    [ 🧩 Data Splicing ] ◄────────── (Original Prompt + Response)
    (Embed reconstructed CoT in <think> tags, splicing with original prompt/response)
                      │
                      ▼
             (Result: claude-opus-4.6/4.7 inverted sets)

  C. Final Coder SFT Curriculum Pipeline
     ___________________________________________
    |                                           |
    |       Base Model (Qwopus3.6-27B-v2)       |
    |___________________________________________|
                      │
                      ▼
    [ 📦 Phase 1: Format Inception ] ──► [ 🛠️ Phase 2: Agent/Coding Expansion ] ──► [ 🚀 Phase 3: Long-Context SFT ]
      ( < 4096 tokens )                     ( 4096 - 8192 tokens )                     ( 8192 - 32K tokens )
      (Stable <think> format)               (Tool traces + coding tasks)               (Long / multi-turn / replay)
                      │                                                                            │
                      └─────────────────────────────┬──────────────────────────────────────────────┘
                                                    ▼
                                   _______________________________________________
                                  |                                               |
                                  |   🌟 Final Model: Qwopus-3.6-27B-Coder        |
                                  |_______________________________________________|

Due to the complex and diverse format of agent trajectory datasets, rigorous cleaning and format standardization were applied to ensure data quality.


📚 5. Three-Stage Curriculum Learning

To steadily scale reasoning quality under long-context inference, Qwopus-3.6-27B-Coder uses a curriculum-style data mixture building on the approach proven in the Qwopus coder line. The model is first stabilized on short, clean reasoning samples, then exposed to complex coding and agent traces, and finally reinforced with longer contexts plus replay data.

Curriculum Stage Focus & Sample Characteristics Strategy Details
📦 Stage 1: Format Inception • Limit context within 4,096 tokens
• Emphasize stable reasoning templates
Focuses on short-to-medium length, cleanly formatted reasoning samples. The primary goal is to establish reliable structured reasoning output, including stable <think> boundaries, before exposing the model to longer chains.
🛠️ Stage 2: Complexity Expansion • Extend length to 4,096 - 8,192 tokens
• Introduce higher-difficulty coding and agent samples
Gradually increases the ratio of complex reasoning chains, code debugging tasks, and multi-turn tool traces. The model learns to connect reasoning, action selection, and environment feedback.
🚀 Stage 3: Long-Context SFT • Progressively scale samples up to 32K tokens
• Use short-sample replay
Pushes the model toward long-context and multi-turn reasoning while replaying high-quality short samples to reduce instruction-following drift. The 32K figure describes the fine-tuning sequence/data mixture target, not a hard architectural limit.

🎯 6. Recommended Use Cases & Known Limits

Good Fits
Agentic code generation and repository-level debugging, complex tool-call orchestration, structured multi-step reasoning, code review and patch generation, DevOps scripting and automation, and any workflow requiring deep logical reasoning combined with tool execution.
Known Limits
As a specialized coder model, it has not undergone comprehensive general-domain safety evaluation. Capability decay may occur in non-coding or non-agent tasks. Tool-call behavior depends strongly on prompt format and tool schema consistency. Long-context performance beyond 32K may require RoPE/YaRN scaling.

Deployment note: The model may emit reasoning inside <think> and </think> tags. Front-end applications and agent frameworks should parse or hide these sections where appropriate. For tool calling, ensure the prompt format and system prompt match the training data configuration to activate agent capabilities.


⚠️ 7. Training & Deployment Notes

Compatibility Notes

  • Tool Calling Format: To activate the model's agent capabilities, ensure the prompt format and system prompt include appropriate tool definitions and match the training data format.
  • Reasoning Output Extraction: The model's thinking process is wrapped in <think> and </think> tags. Front-end applications may need to parse and hide these tags.
  • Long-Context Usage: For contexts beyond 32K, consider enabling RoPE/YaRN scaling (e.g., --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 in llama.cpp).

📋 8. Benchmark Progress

The first completed evaluation is the no-thinking SWE-bench Verified run reported above. Additional local agentic benchmarks remain pending and will be added after testing.

Benchmark Status Result / Reference
SWE-bench Verified ✅ Completed 335/500 = 67.0% (thinking-off, Q5_K_M, RTX 5090 + MTP)
BugFind-15 📋 Pending 9B reference: 79
HermesAgent-20 📋 Pending 9B reference: 85
ToolCall-15 📋 Pending 9B reference: 100
InstructFollow-15 📋 Pending 9B reference: 93

📚 9. Resources & Guides

👉 GitHub Repository: Jackrong-llm-finetuning-guide Access the repository to dive into the codebase and reproduce our results.

👉 Qwen MTP GGUF Processing Workflow A custom splitting and merging methodology designed specifically for Qwen series Multi-Token Prediction (MTP) heads.

👉 benchlocal Evaluation Framework The evaluation framework used to run the local agentic and coding benchmarks.

👉 Qwopus3.6-27B-v2 Model Card Base model card with full MMLU-Pro, SWE-bench, and throughput benchmarks.


🙏 10. Acknowledgements

Special thanks to:

  • The Qwen team for providing the powerful Qwen3.6-27B base model.
  • Unsloth for providing the highly efficient fine-tuning framework.
  • Kyle Hessling for the close collaboration on hardware, training infrastructure, and evaluation support.
  • Open-source datasets and community contributors, particularly lambda/hermes-agent-reasoning-traces for the high-quality agent trajectory data.

📖 11. Citation

@misc{jackrong_qwopus36_27b_coder,
  title        = {Qwopus-3.6-27B-Coder},
  author       = {Jackrong},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Jackrong/Qwopus-3.6-27B-Coder}}
}
Downloads last month
11,291
GGUF
Model size
0.5B params
Architecture
clip
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF

Adapter
(14)
this model
Quantizations
1 model

Datasets used to train Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF