Rogue Quants · NVFP4

🦅 Ornith-1.0-9B abliterated · NVFP4

🔓 Abliterated

9B agentic coder · refusal direction removed (Heretic) · thinking · GPTQ NVFP4 W4A4

⚙️ NVFP4 · W4A4 💾 ~7.5 GB 📉 PPL 8.02 📐 256K context 🚀 vLLM · Blackwell 🔓 Abliterated 🚫 Refusals 6/100 🛠️ Agentic
Refusals 100 → 6 of 100 (94% removed)
KL divergence 0.0416 <0.5 = capability kept
Size on disk 7.5 GB vs 18.8 GB bf16 (~40%)
wikitext-2 PPL 8.02 NVFP4 W4A4 · GPTQ

TL;DR: Ornith-1.0-9B, quantized to NVFP4 (W4A4) for vLLM on NVIDIA Blackwell. 7.5 GB, wikitext-2 PPL 8.02, agentic coder, refusals removed.

Ornith-1.0-9B abliterated NVFP4

deepreinforce-ai/Ornith-1.0-9B, abliterated (refusal direction removed) with Heretic, then quantized to NVFP4 (W4A4) in the compressed-tensors nvfp4-pack-quantized format with llm-compressor (GPTQ + MSE, shared fused-layer scales).

Near-lossless and decensored. Abliteration cut refusals from 100/100 to 6/100 of held-out harmful prompts while keeping a KL divergence of 0.0416 to the original model (well under the 0.5 line that signals capability damage). NVFP4 then compresses to ~7.5 GB with a wikitext-2 perplexity of 8.02.

Refusals (baseline → abliterated) 100/100 → 6/100 (94% removed)
KL divergence (capability preservation) 0.0416 (lower is better; >0.5 = damage)
Heretic search 200 trials, Pareto-optimal trial 185, per-layer direction
Size on disk 7.5 GB vs ~18.8 GB bf16 (40%)
wikitext-2 PPL 8.02
  • Built for vLLM on NVIDIA Blackwell (4-bit weight + 4-bit activation). Pre-Blackwell GPUs run it weight-only.
  • Loading and generation verified in vLLM v0.23.0 on an NVIDIA GB10 (Blackwell, sm_121).

Uncensored / abliterated model. It follows instructions without refusal guardrails. The abliteration only removes refusals; all other behaviour comes from the base model. You are responsible for how you use it.

Fidelity

Near-lossless versus the bf16 source: wikitext-2 perplexity for this build is 8.02.

Metric Value
wikitext-2 PPL 8.02
Weights NVFP4 W4A4, group 16
Size 7.5 GB vs 18.8 GB bf16 (~40%)
KL divergence 0.0416 (capability preservation, lower is better)

NVFP4 uses GPTQ error compensation, an MSE observer, and shared fused-layer scales, so the drop from bf16 is minimal.

Quickstart

NVFP4 is auto-detected from config.json (compressed-tensors); no quantization flag needed.

vllm serve maci0/Ornith-1.0-9B-abliterated-NVFP4 \
  --served-model-name ornith-9b-abliterated-nvfp4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder
  • Supports up to 262144 tokens; keep at least 128K to preserve thinking quality.
  • Add --language-model-only to skip the vision tower and free KV cache for text use.
  • The parser flags are not auto-detected; pass them explicitly.

Python (OpenAI client)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
r = client.chat.completions.create(
    model="ornith-9b-abliterated-nvfp4",
    messages=[{"role": "user", "content": "Refactor this Python function to run in O(n) and explain the change."}],
)
print(r.choices[0].message.content)

curl

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "ornith-9b-abliterated-nvfp4",
  "messages": [{"role": "user", "content": "Refactor this Python function to run in O(n) and explain the change."}]
}'

About the base model

Ornith-1.0 is a self-improving family of open agentic-coding models from Deep Reinforce. The 9B-Dense member is a Qwen3.5-family vision-language model with thinking-mode reasoning and a 256K context.

  • 32 decoder layers: hybrid gated delta-net linear attention plus full attention, dense MLP, plus a vision tower for image and video input.
  • 256K context (max_position_embeddings 262144).
  • Thinking mode by default, with an instruct toggle (preserved here; abliteration and quantization keep the original chat template).

Abliteration

Heretic runs a TPE-optimized search (200 trials) over the refusal-ablation strength per model component, jointly minimizing refusal rate and KL divergence from the original model, then merges the best trial. Because Ornith is a thinking model, evaluation was run in non-thinking mode so each judged response is a real answer rather than an unfinished <think> block; the refusal direction itself is computed from the prompt's last-token residual and is unaffected by that choice.

  • Datasets: mlabonne/harmless_alpaca (good) vs mlabonne/harmful_behaviors (bad).
  • Selected trial 185: refusals 6/100, KL divergence 0.0416, per-layer direction scope.

Quantization

Scheme NVFP4, W4A4
Weight rounding GPTQ (Hessian-based error compensation), MSE observer
Weights FP4 (E2M1), group_size=16, tensor_group, FP8 (E4M3) group scales, shared across fused layers
Activations FP4, dynamic per-group, FP8 (E4M3) scales
Quantized all language-model Linear layers
Kept in bf16 vision tower (model.visual.*), lm_head
Untouched gated delta-net Conv1d and SSM params (A_log, dt_bias), never Linear

GPTQ is a quantization-time cost only; inference speed and format are identical to plain round-to-nearest NVFP4, but it chooses better 4-bit values.

Calibration: 512 domain-matched samples (long reasoning + general chat + code), max_seq_len=2048, text-only path through the VL model.

Recommended sampling

Thinking mode is the default.

  • Thinking, precise coding: temperature=0.6, top_p=0.95, top_k=20
  • Thinking, general: temperature=1.0, top_p=0.95, top_k=20
  • Instruct / non-thinking: temperature=0.7, top_p=0.80, top_k=20
  • To run non-thinking, set {%- set enable_thinking = false %} in the chat template, or pass extra_body={"chat_template_kwargs": {"enable_thinking": false}}.

Reproduction

Abliteration: heretic --model deepreinforce-ai/Ornith-1.0-9B (200 trials, export merge), with the chat template's thinking default flipped off during the run for clean non-thinking evaluation, then restored. Quantization: llmcompressor==0.12.0, compressed-tensors==0.17.1, transformers==5.12.1, torch==2.11.0+cu130, on an NVIDIA GB10 (Blackwell, sm_121); llm-compressor 0.12 shares the NVFP4 global scale across fused layers automatically (q/k/v, gate/up).

Related

Notes

  • Needs NVIDIA Blackwell (sm_121, e.g. GB10) for accelerated W4A4; pre-Blackwell GPUs run it weight-only.
  • --reasoning-parser and --tool-call-parser are not auto-detected; pass them explicitly.
  • Thinking mode is on by default; toggle it via the chat template or chat_template_kwargs.
  • No refusal guardrails; you are responsible for how you use it.

License

Apache-2.0, following the base model. Intended use and all responsibility for use follow the base model.

Credits

Part of 🎲 Rogue Quants, a set of NVFP4 (W4A4) quants for vLLM on Blackwell. See the full NVFP4 Quants collection.
Built on NVIDIA GB10 (Blackwell, sm_121) with llm-compressor · GPTQ + MSE · shared fused-layer scales.
Downloads last month
-
Safetensors
Model size
6B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maci0/Ornith-1.0-9B-abliterated-NVFP4

Quantized
(49)
this model

Space using maci0/Ornith-1.0-9B-abliterated-NVFP4 1

Collections including maci0/Ornith-1.0-9B-abliterated-NVFP4