Instructions to use maci0/Ornith-1.0-9B-abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use maci0/Ornith-1.0-9B-abliterated-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="maci0/Ornith-1.0-9B-abliterated-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("maci0/Ornith-1.0-9B-abliterated-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("maci0/Ornith-1.0-9B-abliterated-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use maci0/Ornith-1.0-9B-abliterated-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "maci0/Ornith-1.0-9B-abliterated-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "maci0/Ornith-1.0-9B-abliterated-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/maci0/Ornith-1.0-9B-abliterated-NVFP4
- SGLang
How to use maci0/Ornith-1.0-9B-abliterated-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "maci0/Ornith-1.0-9B-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "maci0/Ornith-1.0-9B-abliterated-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "maci0/Ornith-1.0-9B-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "maci0/Ornith-1.0-9B-abliterated-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use maci0/Ornith-1.0-9B-abliterated-NVFP4 with Docker Model Runner:
docker model run hf.co/maci0/Ornith-1.0-9B-abliterated-NVFP4
TL;DR: Ornith-1.0-9B, quantized to NVFP4 (W4A4) for vLLM on NVIDIA Blackwell. 7.5 GB, wikitext-2 PPL 8.02, agentic coder, refusals removed.
Ornith-1.0-9B abliterated NVFP4
deepreinforce-ai/Ornith-1.0-9B,
abliterated (refusal direction removed) with Heretic,
then quantized to NVFP4 (W4A4) in the compressed-tensors nvfp4-pack-quantized
format with llm-compressor (GPTQ + MSE,
shared fused-layer scales).
Near-lossless and decensored. Abliteration cut refusals from 100/100 to 6/100 of held-out harmful prompts while keeping a KL divergence of 0.0416 to the original model (well under the 0.5 line that signals capability damage). NVFP4 then compresses to ~7.5 GB with a wikitext-2 perplexity of 8.02.
| Refusals (baseline → abliterated) | 100/100 → 6/100 (94% removed) |
| KL divergence (capability preservation) | 0.0416 (lower is better; >0.5 = damage) |
| Heretic search | 200 trials, Pareto-optimal trial 185, per-layer direction |
| Size on disk | |
| wikitext-2 PPL | 8.02 |
- Built for vLLM on NVIDIA Blackwell (4-bit weight + 4-bit activation). Pre-Blackwell GPUs run it weight-only.
- Loading and generation verified in vLLM v0.23.0 on an NVIDIA GB10 (Blackwell, sm_121).
Uncensored / abliterated model. It follows instructions without refusal guardrails. The abliteration only removes refusals; all other behaviour comes from the base model. You are responsible for how you use it.
Fidelity
Near-lossless versus the bf16 source: wikitext-2 perplexity for this build is 8.02.
| Metric | Value |
|---|---|
| wikitext-2 PPL | 8.02 |
| Weights | NVFP4 W4A4, group 16 |
| Size | 7.5 GB vs 18.8 GB bf16 (~40%) |
| KL divergence | 0.0416 (capability preservation, lower is better) |
NVFP4 uses GPTQ error compensation, an MSE observer, and shared fused-layer scales, so the drop from bf16 is minimal.
Quickstart
NVFP4 is auto-detected from config.json (compressed-tensors); no quantization flag
needed.
vllm serve maci0/Ornith-1.0-9B-abliterated-NVFP4 \
--served-model-name ornith-9b-abliterated-nvfp4 \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder
- Supports up to 262144 tokens; keep at least 128K to preserve thinking quality.
- Add
--language-model-onlyto skip the vision tower and free KV cache for text use. - The parser flags are not auto-detected; pass them explicitly.
Python (OpenAI client)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
r = client.chat.completions.create(
model="ornith-9b-abliterated-nvfp4",
messages=[{"role": "user", "content": "Refactor this Python function to run in O(n) and explain the change."}],
)
print(r.choices[0].message.content)
curl
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "ornith-9b-abliterated-nvfp4",
"messages": [{"role": "user", "content": "Refactor this Python function to run in O(n) and explain the change."}]
}'
About the base model
Ornith-1.0 is a self-improving family of open agentic-coding models from Deep Reinforce. The 9B-Dense member is a Qwen3.5-family vision-language model with thinking-mode reasoning and a 256K context.
- 32 decoder layers: hybrid gated delta-net linear attention plus full attention, dense MLP, plus a vision tower for image and video input.
- 256K context (
max_position_embeddings262144). - Thinking mode by default, with an instruct toggle (preserved here; abliteration and quantization keep the original chat template).
Abliteration
Heretic runs a TPE-optimized search (200 trials) over
the refusal-ablation strength per model component, jointly minimizing refusal rate and
KL divergence from the original model, then merges the best trial. Because Ornith is a
thinking model, evaluation was run in non-thinking mode so each judged response is a real
answer rather than an unfinished <think> block; the refusal direction itself is computed
from the prompt's last-token residual and is unaffected by that choice.
- Datasets:
mlabonne/harmless_alpaca(good) vsmlabonne/harmful_behaviors(bad). - Selected trial 185: refusals 6/100, KL divergence 0.0416, per-layer direction scope.
Quantization
| Scheme | NVFP4, W4A4 |
| Weight rounding | GPTQ (Hessian-based error compensation), MSE observer |
| Weights | FP4 (E2M1), group_size=16, tensor_group, FP8 (E4M3) group scales, shared across fused layers |
| Activations | FP4, dynamic per-group, FP8 (E4M3) scales |
| Quantized | all language-model Linear layers |
| Kept in bf16 | vision tower (model.visual.*), lm_head |
| Untouched | gated delta-net Conv1d and SSM params (A_log, dt_bias), never Linear |
GPTQ is a quantization-time cost only; inference speed and format are identical to plain round-to-nearest NVFP4, but it chooses better 4-bit values.
Calibration: 512 domain-matched samples (long reasoning + general chat + code),
max_seq_len=2048, text-only path through the VL model.
Recommended sampling
Thinking mode is the default.
- Thinking, precise coding:
temperature=0.6,top_p=0.95,top_k=20 - Thinking, general:
temperature=1.0,top_p=0.95,top_k=20 - Instruct / non-thinking:
temperature=0.7,top_p=0.80,top_k=20 - To run non-thinking, set
{%- set enable_thinking = false %}in the chat template, or passextra_body={"chat_template_kwargs": {"enable_thinking": false}}.
Reproduction
Abliteration: heretic --model deepreinforce-ai/Ornith-1.0-9B (200 trials, export merge),
with the chat template's thinking default flipped off during the run for clean non-thinking
evaluation, then restored. Quantization: llmcompressor==0.12.0,
compressed-tensors==0.17.1, transformers==5.12.1, torch==2.11.0+cu130, on an NVIDIA
GB10 (Blackwell, sm_121); llm-compressor 0.12 shares the NVFP4 global scale across fused
layers automatically (q/k/v, gate/up).
Related
- Base model: deepreinforce-ai/Ornith-1.0-9B
- Space: Rogue Quants
- Collection: NVFP4 Quants
- Sibling NVFP4 quants:
Notes
- Needs NVIDIA Blackwell (sm_121, e.g. GB10) for accelerated W4A4; pre-Blackwell GPUs run it weight-only.
--reasoning-parserand--tool-call-parserare not auto-detected; pass them explicitly.- Thinking mode is on by default; toggle it via the chat template or
chat_template_kwargs. - No refusal guardrails; you are responsible for how you use it.
License
Apache-2.0, following the base model. Intended use and all responsibility for use follow the base model.
Credits
- Base model: Deep Reinforce (Ornith-1.0)
- Abliteration: Heretic by Philipp Emanuel Weidmann
- Quantization tooling: llm-compressor / compressed-tensors
Built on NVIDIA GB10 (Blackwell, sm_121) with llm-compressor · GPTQ + MSE · shared fused-layer scales.
- Downloads last month
- -
Model tree for maci0/Ornith-1.0-9B-abliterated-NVFP4
Base model
deepreinforce-ai/Ornith-1.0-9B