Ornith-1.0-35B-NVFP4
NVFP4 post-training quantization of deepreinforce-ai/Ornith-1.0-35B (Qwen3.5-MoE, 34.7B params) produced with the NVIDIA Model Optimizer.
Quantized, validated and published by Robert Ressl on a single NVIDIA RTX PRO 6000
Blackwell Workstation Edition, using nvidia-modelopt 0.44.0.
Status
This checkpoint is in the Qwen3_5MoeForConditionalGeneration / qwen3_5_moe form (the same form as
nvidia/Qwen3.6-35B-A3B-NVFP4) and is validated to load and generate on vLLM (nightly, --quantization modelopt,
--attention-backend flashinfer, --moe-backend marlin) on a Blackwell GPU.
Quantization
- Format: NVFP4 (4-bit floating point, block size 16) — experts-only quantization
(MoE expert weights quantized to NVFP4 via ModelOpt's
NVFP4_EXPERTS_ONLY_CFG; attention QKV projections, shared experts and the vision encoder kept in higher precision for accuracy). - Tool:
nvidia-modelopt0.44.0 —mtq.quantize+export_hf_checkpoint(Unified HF checkpoint). - Calibration: 512 samples from
cnn_dailymail, seq len 512, max algorithm. The fullQwen3_5MoeForConditionalGenerationmodel is quantized (language-model experts -> NVFP4, vision encoder in BF16), matching the modelopt VLM flow. - Size: ~23 GB (vs ~69 GB BF16, ~3x smaller).
- Caveat: with 512 calibration samples a few rarely-activated experts fall back to a weight-derived amax;
use more calibration data (
--calib_size 2048+) to activate all experts if you re-quantize.
Hardware requirement
NVFP4 inference requires an NVIDIA Blackwell GPU (compute capability sm_120, e.g. RTX PRO 6000 Blackwell, RTX 5090, B200/H200). It will NOT run on Ada/Hopper or older.
Deployment (vLLM)
pip install vllm # use a recent nightly that supports qwen3_5_moe
vllm serve ressl/Ornith-1.0-35B-NVFP4 \
--quantization modelopt \
--moe-backend marlin --attention-backend flashinfer \
--max-model-len 262144 --gpu-memory-utilization 0.90 \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--reasoning-parser qwen3 --trust-remote-code
A matching runtime environment needs the CUDA toolkit (nvcc + NVRTC + curand headers) available at
$CUDA_HOME so flashinfer can JIT its kernels, or use a prebuilt vllm/vllm-openai:nightly container.
SGLang note
SGLang 0.5.9 currently loads config.text_config for qwen3_5_moe as a plain dict, which trips its
get_hf_text_config assertion — so SGLang 0.5.9 cannot serve this architecture yet (this affects
nvidia/Qwen3.6-35B-A3B-NVFP4 the same way). Use vLLM until a newer SGLang fixes it.
Reasoning model
Ornith-1.0-35B is a reasoning model: the assistant turn opens with a thinking block before the final
answer. Use --reasoning-parser qwen3 to surface the chain-of-thought in a separate reasoning_content field.
License
MIT — same as the original Ornith-1.0-35B. Original model: deepreinforce-ai/Ornith-1.0-35B.
Citation
@misc{ornith-35b,
title = {{Ornith-1.0-35B}: Agentic Coding, Open to All},
url = {https://deep-reinforce.com/ornith_1_0.html},
author = {{DeepReinforce Team}},
year = {2026}
}
- Downloads last month
- 364
Model tree for ressl/Ornith-1.0-35B-NVFP4
Base model
deepreinforce-ai/Ornith-1.0-35B