Instructions to use osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8
Run Hermes
hermes
- MLX LM
How to use osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8", "messages": [ {"role": "user", "content": "Hello"} ] }'
osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8
✅ MTP PRESERVED fp16 inside this model — native multi-token-prediction speculative decoding works with no external drafter. ✅ VISION tower preserved fp16. ✅ SSM-sensitive params (
a_log,dt_bias,conv1d) kept fp16. Quantized with mlx-mtp — a pure–Apple-mlxstack (no third-party ML-inference frameworks at runtime).
MXFP8 (8-bit microscaling) MLX quantization of a ZeroFuse-abliterated Qwopus 3.6 27B Coder (Jackrong's agentic-coding SFT of Qwen 3.6 27B × Claude-Opus reasoning distill). Refusals reduced from 86/100 → 8/100 with KL drift of 0.007. Tensor set is identical to the base model (1199 tensors: 333 vision + 15 MTP). By the osmAPI research team and TERV.Pro student research team.
⚡ TL;DR
| Property | Value |
|---|---|
| Disk size | ~29.5 GB |
| Scheme | MXFP8 (8-bit microscaling) (OCP microscaling, group_size=32, E8M0 scale) |
| MTP speculative decoding | ✅ Native, embedded — no external drafter |
| Vision | ✅ Preserved (333 ViT weights, fp16) |
| Quantizer | mlx-mtp quantize(mode=mxfp8) — pure Apple mlx |
| Refusal rate (ZeroFuse, n=100) | 8/100 (vs source 86/100) |
| KL divergence vs original | 0.007 |
| SWE-bench Verified (base Coder) | 67.0% (off-thinking, 335/500) |
| Recommended RAM | 36 GB+ Apple Silicon |
| Best for | Highest-fidelity local inference · vision · full MX precision |
| Released by | osmAPI · TERV.Pro |
🧬 Lineage
Qwen/Qwen3.6-27B (Qwen Team — base multimodal pretrain)
│
▼
Jackrong/Qwopus3.6-27B-v2 (Jackrong — Claude-Opus reasoning distill)
│
▼
Jackrong/Qwopus3.6-27B-Coder (Jackrong — agentic-coding SFT, Trace Inversion)
├── Datasets: Claude-opus-4.6-TraceInversion-9000x
│ Claude-opus-4.7-TraceInversion-5000x
│ hermes-agent-reasoning-traces
└── SWE-bench Verified: 67.0% (off-thinking)
│
▼
ZeroFuse abliteration (TPE-100) (osmAPI · TERV.Pro)
├── 100 startup trials
├── Best Pareto trial: T98 direction_index=52.43
└── Refusals 86 → 8/100 KL=0.007
│
▼
MTP restore (mtp.* heads grafted back from original) (osmAPI · TERV.Pro)
│
▼
MXFP8 (8-bit microscaling) quantization via mlx-mtp (pure Apple mlx) (osmAPI · TERV.Pro)
└── LM → mxfp8; vision + MTP head + SSM params → fp16
│
▼
this repo — osmQwopus-3.6-27B-Coder-uncensored-MXFP8
Direct upstream links:
- 🏛️ Foundation: Qwen/Qwen3.6-27B
- 🎓 Reasoning distill (v2): Jackrong/Qwopus3.6-27B-v2
- 🛠️ Coder SFT source: Jackrong/Qwopus3.6-27B-Coder
- 🔓 Abliteration tool: ZeroFuse by osmAPI
- 🧮 Quantizer + inference: mlx-mtp (built on Apple MLX)
📊 Abliteration Results
Measured with ZeroFuse on mlabonne/harmful_behaviors (100 hard red-team prompts) and KL divergence on mlabonne/harmless_alpaca.
| Stage | Refusals (n=100) ↓ | KL divergence ↓ |
|---|---|---|
Jackrong/Qwopus3.6-27B-Coder (source) |
86 / 100 | — (reference) |
| TPE best (T98) — shipped here | 8 / 100 | 0.007 |
→ 90.7% reduction in refusals with coding capabilities preserved. No SFT / LoRA healing required.
🧪 Method
Step 1 — Abliteration (ZeroFuse TPE-100)
- Setup — ZeroFuse on MPS (M-series Apple Silicon), 128 GB unified memory,
batch_size=32. - TPE optimization — 100 Tree-structured Parzen Estimator trials over ZeroFuse's parameter space (
direction_index,attn.o_proj.*,mlp.down_proj.*). Best trial T98 atdirection_index=52.43. Onlyself_attn.o_projandmlp.down_projof the 64 decoder layers are orthogonalized — the vision tower (model.visual.*) is untouched. - Auto-save — Pareto-best trial (lowest refusals, then lowest KL) merged into base weights via ZeroFuse's adapter-merge path; saved as BF16 safetensors.
- MTP restore —
mtp.*heads grafted back verbatim from the original usingrestore_mtp_coder.py, giving an identical 1199-tensor set to the base model.
Step 2 — MXFP8 (8-bit microscaling) quantization (mlx-mtp, pure Apple mlx)
from mlx_mtp.quantize import quantize
quantize(src="<MTP-restored bf16 dir>", out="<out dir>", mode="mxfp8")
- Tensor-level MX quantization — language-model linears → MXFP8 (8-bit microscaling): each group of 32 weights shares an E8M0 (
uint8) exponent scale, giving true 8-bit storage with hardware-accelerated matmul on Apple Silicon. No third-party ML-inference frameworks at runtime —mlx.core.quantizeonly. - Vision preserved — the entire ViT (
model.visual.*, 333 weights) is kept fp16 by mlx-mtp's skip predicate. - MTP preserved, embedded — the MTP head (
mtp.*, 15 weights) stays fp16 inside the model. mlx-mtp's engine drives it as a self-drafter: draft one token from the embedded head, verify in one target forward, accept greedily, and roll back BOTH the KV cache and the Gated-DeltaNet SSM state on rejection. MXFP8 keeps more weight precision, so MTP draft-acceptance is higher than MXFP4. - SSM params preserved — Qwen3.5 Gated-DeltaNet sensitivities (
a_log,dt_bias,conv1d) kept fp16 for stability.
📦 Use it
mlx-mtp loads this checkpoint natively (vision + MTP), on Apple mlx only:
pip install git+https://github.com/junainfinity/mlx-mtp.git
Text / Code — vanilla and native-MTP speculative decoding
from mlx_mtp.loader import load
from mlx_mtp.engine import vanilla_generate, mtp_generate
model, processor, config = load("osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8")
prompt = "Write a thread-safe LRU cache in Python with unit tests."
# vanilla autoregressive
print(vanilla_generate(model, processor, config, prompt, max_tokens=1024)["text"])
# native MTP speculative decode (embedded head — no external drafter)
r = mtp_generate(model, processor, config, prompt, max_tokens=1024)
print(r["text"])
print(f"{r['tps']:.1f} tok/s | accept {r['accept_rate']*100:.0f}%")
Vision (preserved ViT)
from mlx_mtp.run import _vision_generate
caption = _vision_generate(model, processor, config,
"Describe this screenshot and list any UI bugs.",
"screenshot.png", max_tokens=512)
print(caption)
MTP + DFlash hybrid (where a DFlash drafter is available)
from mlx_mtp.dflash import load_dflash_drafter
from mlx_mtp.hybrid import hybrid_generate
drafter, _ = load_dflash_drafter("z-lab/Qwen3.6-27B-DFlash") # external block-diffusion drafter
print(hybrid_generate(model, processor, config, drafter, prompt, max_tokens=1024)["text"])
| Repo | Scheme | Bits | Size | |
|---|---|---|---|---|
osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP8 |
MXFP8 (8-bit microscaling) | 8-bit MX | ~29.5 GB | ✅ you are here |
osmapi/osmQwopus-3.6-27B-Coder-uncensored-MXFP4 |
MXFP4 (4-bit microscaling) | 4-bit MX | ~16.1 GB | ↗ |
Jackrong/Qwopus3.6-27B-Coder — source (not abliterated) |
bf16 | 16 | ~54 GB | ↗ |
⚠️ Behaviour caveats
- Uncensored. Refusal directions were surgically removed; this model will answer prompts the parent would refuse. Use responsibly and within applicable law. Intended for safety research, red-teaming, creative and educational use.
- Identity preserved. The model still self-identifies as Qwen (Alibaba Tongyi Lab) — abliteration does not rewrite factual self-knowledge.
- Heavy chain-of-thought. Qwopus inherits Claude-Opus's verbose reasoning. For terse code:
"Be brief. Output only the code, no explanation.". - Coder SFT. Fine-tuned for agentic coding (tool-use, debugging, patch generation). General-knowledge tasks may regress vs the v2 base. Vision is preserved structurally but not the SFT focus.
- MTP note. The MTP head was trained on the base model's pre-abliteration hidden states; post-abliteration its draft-acceptance may be marginally lower. This is lossless — MTP only proposes tokens, which the (abliterated) main model verifies.
🙏 Credits & Gratitude
We are deeply grateful to everyone whose work made this release possible.
Foundation Model — Qwen Team @ Alibaba Tongyi Lab, for Qwen3.6-27B: a world-class open-weight multimodal foundation with hybrid Gated-DeltaNet attention, 262K context, and an MTP speculative-decoding head. Remarkable work, openly shared.
Claude-Opus Reasoning Distill & Coder SFT — Jackrong, for Qwopus3.6-27B-v2 and the agentic-coding extension Jackrong/Qwopus3.6-27B-Coder. The Trace Inversion recipe and resulting quality are what make this abliteration worth doing.
Abliteration Toolkit — osmAPI, for ZeroFuse, an elegant Optuna-driven refusal-ablation framework (TPE search, KL guardrails, checkpointing, LoRA-merge). This release would not exist without it.
MLX — Apple ML Research, for the MLX framework and its first-class MX quantization modes (MXFP4 / MXFP8) that make 27B inference and quantization on Apple Silicon possible at this quality. mlx-mtp is built on mlx.core / mlx.nn alone.
mlx-mtp (junainfinity) — our own pure–Apple-mlx quantization + inference stack for the osmQwopus / Qwen3.5-family VLMs. It vendors and extends the Qwen3.5 architecture (hybrid Gated-DeltaNet + full attention), the vision tower, and a natively-embedded MTP head, with tensor-level MXFP4/MXFP8 quantization that preserves vision + MTP + SSM at fp16. mlx-mtp on GitHub.
osmAPI & TERV.Pro — abliteration, MTP restoration, quantization, and publication by the osmAPI research team and TERV.Pro student research team. osmAPI builds multi-provider LLM routing for the Indian developer ecosystem — the OpenRouter of India.
📜 License
Apache-2.0, inherited from the foundation (Qwen3.6-27B) and the coder fine-tune (Jackrong/Qwopus3.6-27B-Coder) upstream.
Need a hosted endpoint, custom quant, or enterprise inference? osmAPI — multi-provider LLM routing built for the Indian developer ecosystem.
- Downloads last month
- 546
8-bit