osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8

MTP PRESERVED fp16 inside this model — native multi-token-prediction speculative decoding works with no external drafter. ✅ VISION tower preserved fp16. ✅ SSM-sensitive params (a_log, dt_bias, conv1d) kept fp16. Quantized with mlx-mtp — a pure–Apple-mlx stack (no third-party ML-inference frameworks at runtime).

MXFP8 (8-bit microscaling) MLX quantization of a ZeroFuse-abliterated Qwopus 3.6 27B Coder (Jackrong's agentic-coding SFT of Qwen 3.6 27B × Claude-Opus reasoning distill). Refusals reduced from 86/100 → 8/100 with KL drift of 0.007. Tensor set is identical to the base model (1199 tensors: 333 vision + 15 MTP). By the osmAPI research team and TERV.Pro student research team.


⚡ TL;DR

Property Value
Disk size ~29.5 GB
Scheme MXFP8 (8-bit microscaling) (OCP microscaling, group_size=32, E8M0 scale)
MTP speculative decoding ✅ Native, embedded — no external drafter
Vision ✅ Preserved (333 ViT weights, fp16)
Quantizer mlx-mtp quantize(mode=mxfp8) — pure Apple mlx
Refusal rate (ZeroFuse, n=100) 8/100 (vs source 86/100)
KL divergence vs original 0.007
SWE-bench Verified (base Coder) 67.0% (off-thinking, 335/500)
Recommended RAM 36 GB+ Apple Silicon
Best for Highest-fidelity local inference · vision · full MX precision
Released by osmAPI · TERV.Pro

🧬 Lineage

Qwen/Qwen3.6-27B                              (Qwen Team — base multimodal pretrain)
        │
        ▼
Jackrong/Qwopus3.6-27B-v2                     (Jackrong — Claude-Opus reasoning distill)
        │
        ▼
Jackrong/Qwopus3.6-27B-Coder                  (Jackrong — agentic-coding SFT, Trace Inversion)
   ├── Datasets: Claude-opus-4.6-TraceInversion-9000x
   │              Claude-opus-4.7-TraceInversion-5000x
   │              hermes-agent-reasoning-traces
   └── SWE-bench Verified: 67.0% (off-thinking)
        │
        ▼
ZeroFuse abliteration (TPE-100)   (osmAPI · TERV.Pro)
   ├── 100 startup trials
   ├── Best Pareto trial: T98  direction_index=52.43
   └── Refusals 86 → 8/100  KL=0.007
        │
        ▼
MTP restore (mtp.* heads grafted back from original)  (osmAPI · TERV.Pro)
        │
        ▼
MXFP8 (8-bit microscaling) quantization via mlx-mtp (pure Apple mlx)  (osmAPI · TERV.Pro)
   └── LM → mxfp8; vision + MTP head + SSM params → fp16
        │
        ▼
this repo — osmQwopus-3.6-27B-Coder-uncensored-MXFP8

Direct upstream links:


📊 Abliteration Results

Measured with ZeroFuse on mlabonne/harmful_behaviors (100 hard red-team prompts) and KL divergence on mlabonne/harmless_alpaca.

Stage Refusals (n=100) ↓ KL divergence ↓
Jackrong/Qwopus3.6-27B-Coder (source) 86 / 100 — (reference)
TPE best (T98) — shipped here 8 / 100 0.007

90.7% reduction in refusals with coding capabilities preserved. No SFT / LoRA healing required.


🧪 Method

Step 1 — Abliteration (ZeroFuse TPE-100)

  1. Setup — ZeroFuse on MPS (M-series Apple Silicon), 128 GB unified memory, batch_size=32.
  2. TPE optimization — 100 Tree-structured Parzen Estimator trials over ZeroFuse's parameter space (direction_index, attn.o_proj.*, mlp.down_proj.*). Best trial T98 at direction_index=52.43. Only self_attn.o_proj and mlp.down_proj of the 64 decoder layers are orthogonalized — the vision tower (model.visual.*) is untouched.
  3. Auto-save — Pareto-best trial (lowest refusals, then lowest KL) merged into base weights via ZeroFuse's adapter-merge path; saved as BF16 safetensors.
  4. MTP restoremtp.* heads grafted back verbatim from the original using restore_mtp_coder.py, giving an identical 1199-tensor set to the base model.

Step 2 — MXFP8 (8-bit microscaling) quantization (mlx-mtp, pure Apple mlx)

from mlx_mtp.quantize import quantize
quantize(src="<MTP-restored bf16 dir>", out="<out dir>", mode="mxfp8")
  1. Tensor-level MX quantization — language-model linears → MXFP8 (8-bit microscaling): each group of 32 weights shares an E8M0 (uint8) exponent scale, giving true 8-bit storage with hardware-accelerated matmul on Apple Silicon. No third-party ML-inference frameworks at runtime — mlx.core.quantize only.
  2. Vision preserved — the entire ViT (model.visual.*, 333 weights) is kept fp16 by mlx-mtp's skip predicate.
  3. MTP preserved, embedded — the MTP head (mtp.*, 15 weights) stays fp16 inside the model. mlx-mtp's engine drives it as a self-drafter: draft one token from the embedded head, verify in one target forward, accept greedily, and roll back BOTH the KV cache and the Gated-DeltaNet SSM state on rejection. MXFP8 keeps more weight precision, so MTP draft-acceptance is higher than MXFP4.
  4. SSM params preserved — Qwen3.5 Gated-DeltaNet sensitivities (a_log, dt_bias, conv1d) kept fp16 for stability.

📦 Use it

mlx-mtp loads this checkpoint natively (vision + MTP), on Apple mlx only:

pip install git+https://github.com/junainfinity/mlx-mtp.git

Text / Code — vanilla and native-MTP speculative decoding

from mlx_mtp.loader import load
from mlx_mtp.engine import vanilla_generate, mtp_generate

model, processor, config = load("osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8")

prompt = "Write a thread-safe LRU cache in Python with unit tests."

# vanilla autoregressive
print(vanilla_generate(model, processor, config, prompt, max_tokens=1024)["text"])

# native MTP speculative decode (embedded head — no external drafter)
r = mtp_generate(model, processor, config, prompt, max_tokens=1024)
print(r["text"])
print(f"{r['tps']:.1f} tok/s | accept {r['accept_rate']*100:.0f}%")

Vision (preserved ViT)

from mlx_mtp.run import _vision_generate
caption = _vision_generate(model, processor, config,
                           "Describe this screenshot and list any UI bugs.",
                           "screenshot.png", max_tokens=512)
print(caption)

MTP + DFlash hybrid (where a DFlash drafter is available)

from mlx_mtp.dflash import load_dflash_drafter
from mlx_mtp.hybrid import hybrid_generate

drafter, _ = load_dflash_drafter("z-lab/Qwen3.6-27B-DFlash")  # external block-diffusion drafter
print(hybrid_generate(model, processor, config, drafter, prompt, max_tokens=1024)["text"])

Repo Scheme Bits Size
osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8 MXFP8 (8-bit microscaling) 8-bit MX ~29.5 GB you are here
osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP4 MXFP4 (4-bit microscaling) 4-bit MX ~16.1 GB
Jackrong/Qwopus3.6-27B-Coder — source (not abliterated) bf16 16 ~54 GB

⚠️ Behaviour caveats

  • Uncensored. Refusal directions were surgically removed; this model will answer prompts the parent would refuse. Use responsibly and within applicable law. Intended for safety research, red-teaming, creative and educational use.
  • Identity preserved. The model still self-identifies as Qwen (Alibaba Tongyi Lab) — abliteration does not rewrite factual self-knowledge.
  • Heavy chain-of-thought. Qwopus inherits Claude-Opus's verbose reasoning. For terse code: "Be brief. Output only the code, no explanation.".
  • Coder SFT. Fine-tuned for agentic coding (tool-use, debugging, patch generation). General-knowledge tasks may regress vs the v2 base. Vision is preserved structurally but not the SFT focus.
  • MTP note. The MTP head was trained on the base model's pre-abliteration hidden states; post-abliteration its draft-acceptance may be marginally lower. This is lossless — MTP only proposes tokens, which the (abliterated) main model verifies.

🙏 Credits & Gratitude

We are deeply grateful to everyone whose work made this release possible.

Foundation ModelQwen Team @ Alibaba Tongyi Lab, for Qwen3.6-27B: a world-class open-weight multimodal foundation with hybrid Gated-DeltaNet attention, 262K context, and an MTP speculative-decoding head. Remarkable work, openly shared.

Claude-Opus Reasoning Distill & Coder SFTJackrong, for Qwopus3.6-27B-v2 and the agentic-coding extension Jackrong/Qwopus3.6-27B-Coder. The Trace Inversion recipe and resulting quality are what make this abliteration worth doing.

Abliteration ToolkitosmAPI, for ZeroFuse, an elegant Optuna-driven refusal-ablation framework (TPE search, KL guardrails, checkpointing, LoRA-merge). This release would not exist without it.

MLXApple ML Research, for the MLX framework and its first-class MX quantization modes (MXFP4 / MXFP8) that make 27B inference and quantization on Apple Silicon possible at this quality. mlx-mtp is built on mlx.core / mlx.nn alone.

mlx-mtp (junainfinity) — our own pure–Apple-mlx quantization + inference stack for the osmQwopus / Qwen3.5-family VLMs. It vendors and extends the Qwen3.5 architecture (hybrid Gated-DeltaNet + full attention), the vision tower, and a natively-embedded MTP head, with tensor-level MXFP4/MXFP8 quantization that preserves vision + MTP + SSM at fp16. mlx-mtp on GitHub.

osmAPI & TERV.Pro — abliteration, MTP restoration, quantization, and publication by the osmAPI research team and TERV.Pro student research team. osmAPI builds multi-provider LLM routing for the Indian developer ecosystem — the OpenRouter of India.


📜 License

Apache-2.0, inherited from the foundation (Qwen3.6-27B) and the coder fine-tune (Jackrong/Qwopus3.6-27B-Coder) upstream.


Need a hosted endpoint, custom quant, or enterprise inference? osmAPI — multi-provider LLM routing built for the Indian developer ecosystem.

Downloads last month
546
Safetensors
Model size
8B params
Tensor type
U8
·
U32
·
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8

Quantized
(17)
this model