Ornith-1.0-9B — NVFP4-AWQ
NVFP4 (4-bit) AWQ quantization of deepreinforce-ai/Ornith-1.0-9B, produced with NVIDIA Model Optimizer for deployment on Blackwell GPUs (RTX 50-series, B100/B200, GB200) via TensorRT-LLM, vLLM, or SGLang.
This is a dense 9B coding model. The quantized weights are ~4-bit; expect a footprint around 5–6 GB plus KV cache.
⚠️ Required config patch — read before loading
The upstream Ornith-1.0-9B config.json ships with the multimodal identity:
"architectures": ["Qwen3_5ForConditionalGeneration"],
"model_type": "qwen3_5",
with the text parameters nested inside a text_config block. Loaded as-is through the text/CausalLM path, the model either fails to construct ('Qwen3_5Config' object has no attribute 'vocab_size') or produces garbage output, and ModelOpt's exporter rejects the ForConditionalGeneration model_type.
This checkpoint already has the corrected, flattened config baked in. The text sub-config has been promoted to top level and the identity set to the dense text class:
"architectures": ["Qwen3_5ForCausalLM"],
"model_type": "qwen3_5_text",
If you re-derive or re-export from the original weights, you must apply the same flatten yourself or you will hit the same wall. No code or behavior of the model is changed — only the config is restructured to route through the text decoder.
Requirements
transformers >= 5.8.1(theqwen3_5architecture and its text class only exist in recent releases; older versions either don't recognize it or load the wrong class)- A Blackwell GPU for inference
- TensorRT-LLM, vLLM, or SGLang with NVFP4 support
Usage
vLLM
vllm serve <your-username>/Ornith-1.0-9B-NVFP4-AWQ \
--served-model-name Ornith-1.0-9B \
--max-model-len 262144 \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--reasoning-parser qwen3
SGLang
python -m sglang.launch_server \
--model-path <your-username>/Ornith-1.0-9B-NVFP4-AWQ \
--served-model-name Ornith-1.0-9B \
--context-length 262144 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3
Transformers (quick test)
from transformers import AutoModelForCausalLM, AutoTokenizer
name = "<your-username>/Ornith-1.0-9B-NVFP4-AWQ"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(name, dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Write a Python function is_prime(n). Keep it short."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256)
print(tok.decode(out[0], skip_special_tokens=True))
This is a reasoning model — replies open with a <think> block before the final answer.
Quantization details
| Method | NVFP4 weight quant + AWQ (awq_lite) |
| Tool | NVIDIA Model Optimizer (hf_ptq.py, unified HF export) |
| Calibration | 512 samples, batch size 8, seq len 512 |
| Calibration data | cnn_dailymail + Nemotron-Post-Training-Dataset-v2 (ModelOpt default mix) |
| KV cache | quantized |
| Excluded from quant | lm_head, router/gate layers, conv1d / linear-attention paths |
Pre- vs post-quantization sample generations were coherent and on-topic, indicating minimal quality degradation from the 4-bit conversion.
License
MIT, inherited from the base model. Full credit to DeepReinforce for Ornith-1.0; this repository only redistributes a quantized derivative.
Disclaimer
Community quantization, not affiliated with or endorsed by DeepReinforce. Provided as-is. Verify outputs for your use case.
- Downloads last month
- 46
Model tree for Luni/Ornith-1.0-9B-NVFP4-AWQ
Base model
deepreinforce-ai/Ornith-1.0-9B