Ornith-1.0-35B-NVFP4

NVFP4 post-training quantization of deepreinforce-ai/Ornith-1.0-35B (Qwen3.5-MoE, 34.7B params) produced with the NVIDIA Model Optimizer.

Quantized, validated and published by Robert Ressl on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition, using nvidia-modelopt 0.44.0.

Status

This checkpoint is in the Qwen3_5MoeForConditionalGeneration / qwen3_5_moe form (the same form as nvidia/Qwen3.6-35B-A3B-NVFP4) and is validated to load and generate on vLLM (nightly, --quantization modelopt, --attention-backend flashinfer, --moe-backend marlin) on a Blackwell GPU.

Quantization

  • Format: NVFP4 (4-bit floating point, block size 16) — experts-only quantization (MoE expert weights quantized to NVFP4 via ModelOpt's NVFP4_EXPERTS_ONLY_CFG; attention QKV projections, shared experts and the vision encoder kept in higher precision for accuracy).
  • Tool: nvidia-modelopt 0.44.0 — mtq.quantize + export_hf_checkpoint (Unified HF checkpoint).
  • Calibration: 512 samples from cnn_dailymail, seq len 512, max algorithm. The full Qwen3_5MoeForConditionalGeneration model is quantized (language-model experts -> NVFP4, vision encoder in BF16), matching the modelopt VLM flow.
  • Size: ~23 GB (vs ~69 GB BF16, ~3x smaller).
  • Caveat: with 512 calibration samples a few rarely-activated experts fall back to a weight-derived amax; use more calibration data (--calib_size 2048+) to activate all experts if you re-quantize.

Hardware requirement

NVFP4 inference requires an NVIDIA Blackwell GPU (compute capability sm_120, e.g. RTX PRO 6000 Blackwell, RTX 5090, B200/H200). It will NOT run on Ada/Hopper or older.

Deployment (vLLM)

pip install vllm  # use a recent nightly that supports qwen3_5_moe
vllm serve ressl/Ornith-1.0-35B-NVFP4 \
    --quantization modelopt \
    --moe-backend marlin --attention-backend flashinfer \
    --max-model-len 262144 --gpu-memory-utilization 0.90 \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --reasoning-parser qwen3 --trust-remote-code

A matching runtime environment needs the CUDA toolkit (nvcc + NVRTC + curand headers) available at $CUDA_HOME so flashinfer can JIT its kernels, or use a prebuilt vllm/vllm-openai:nightly container.

SGLang note

SGLang 0.5.9 currently loads config.text_config for qwen3_5_moe as a plain dict, which trips its get_hf_text_config assertion — so SGLang 0.5.9 cannot serve this architecture yet (this affects nvidia/Qwen3.6-35B-A3B-NVFP4 the same way). Use vLLM until a newer SGLang fixes it.

Reasoning model

Ornith-1.0-35B is a reasoning model: the assistant turn opens with a thinking block before the final answer. Use --reasoning-parser qwen3 to surface the chain-of-thought in a separate reasoning_content field.

License

MIT — same as the original Ornith-1.0-35B. Original model: deepreinforce-ai/Ornith-1.0-35B.

Citation

@misc{ornith-35b,
  title  = {{Ornith-1.0-35B}: Agentic Coding, Open to All},
  url    = {https://deep-reinforce.com/ornith_1_0.html},
  author = {{DeepReinforce Team}},
  year   = {2026}
}
Downloads last month
364
Safetensors
Model size
19B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ressl/Ornith-1.0-35B-NVFP4

Quantized
(110)
this model