Ornith-1.0-9B — NVFP4-AWQ

NVFP4 (4-bit) AWQ quantization of deepreinforce-ai/Ornith-1.0-9B, produced with NVIDIA Model Optimizer for deployment on Blackwell GPUs (RTX 50-series, B100/B200, GB200) via TensorRT-LLM, vLLM, or SGLang.

This is a dense 9B coding model. The quantized weights are ~4-bit; expect a footprint around 5–6 GB plus KV cache.

⚠️ Required config patch — read before loading

The upstream Ornith-1.0-9B config.json ships with the multimodal identity:

"architectures": ["Qwen3_5ForConditionalGeneration"],
"model_type": "qwen3_5",

with the text parameters nested inside a text_config block. Loaded as-is through the text/CausalLM path, the model either fails to construct ('Qwen3_5Config' object has no attribute 'vocab_size') or produces garbage output, and ModelOpt's exporter rejects the ForConditionalGeneration model_type.

This checkpoint already has the corrected, flattened config baked in. The text sub-config has been promoted to top level and the identity set to the dense text class:

"architectures": ["Qwen3_5ForCausalLM"],
"model_type": "qwen3_5_text",

If you re-derive or re-export from the original weights, you must apply the same flatten yourself or you will hit the same wall. No code or behavior of the model is changed — only the config is restructured to route through the text decoder.

Requirements

  • transformers >= 5.8.1 (the qwen3_5 architecture and its text class only exist in recent releases; older versions either don't recognize it or load the wrong class)
  • A Blackwell GPU for inference
  • TensorRT-LLM, vLLM, or SGLang with NVFP4 support

Usage

vLLM

vllm serve <your-username>/Ornith-1.0-9B-NVFP4-AWQ \
  --served-model-name Ornith-1.0-9B \
  --max-model-len 262144 \
  --enable-auto-tool-choice --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3

SGLang

python -m sglang.launch_server \
  --model-path <your-username>/Ornith-1.0-9B-NVFP4-AWQ \
  --served-model-name Ornith-1.0-9B \
  --context-length 262144 \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3

Transformers (quick test)

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "<your-username>/Ornith-1.0-9B-NVFP4-AWQ"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(name, dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "Write a Python function is_prime(n). Keep it short."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256)
print(tok.decode(out[0], skip_special_tokens=True))

This is a reasoning model — replies open with a <think> block before the final answer.

Quantization details

Method NVFP4 weight quant + AWQ (awq_lite)
Tool NVIDIA Model Optimizer (hf_ptq.py, unified HF export)
Calibration 512 samples, batch size 8, seq len 512
Calibration data cnn_dailymail + Nemotron-Post-Training-Dataset-v2 (ModelOpt default mix)
KV cache quantized
Excluded from quant lm_head, router/gate layers, conv1d / linear-attention paths

Pre- vs post-quantization sample generations were coherent and on-topic, indicating minimal quality degradation from the 4-bit conversion.

License

MIT, inherited from the base model. Full credit to DeepReinforce for Ornith-1.0; this repository only redistributes a quantized derivative.

Disclaimer

Community quantization, not affiliated with or endorsed by DeepReinforce. Provided as-is. Verify outputs for your use case.

Downloads last month
46
Safetensors
Model size
6B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Luni/Ornith-1.0-9B-NVFP4-AWQ

Quantized
(33)
this model