Instructions to use davidyu-nv/Qwen3.5-9B-NVFP4-MSE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use davidyu-nv/Qwen3.5-9B-NVFP4-MSE with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="davidyu-nv/Qwen3.5-9B-NVFP4-MSE") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("davidyu-nv/Qwen3.5-9B-NVFP4-MSE") model = AutoModelForMultimodalLM.from_pretrained("davidyu-nv/Qwen3.5-9B-NVFP4-MSE") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use davidyu-nv/Qwen3.5-9B-NVFP4-MSE with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "davidyu-nv/Qwen3.5-9B-NVFP4-MSE" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "davidyu-nv/Qwen3.5-9B-NVFP4-MSE", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/davidyu-nv/Qwen3.5-9B-NVFP4-MSE
- SGLang
How to use davidyu-nv/Qwen3.5-9B-NVFP4-MSE with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "davidyu-nv/Qwen3.5-9B-NVFP4-MSE" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "davidyu-nv/Qwen3.5-9B-NVFP4-MSE", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "davidyu-nv/Qwen3.5-9B-NVFP4-MSE" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "davidyu-nv/Qwen3.5-9B-NVFP4-MSE", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use davidyu-nv/Qwen3.5-9B-NVFP4-MSE with Docker Model Runner:
docker model run hf.co/davidyu-nv/Qwen3.5-9B-NVFP4-MSE
Qwen3.5-9B NVFP4 (MLP-only, MSE calibration)
NVFP4-quantized variant of Qwen/Qwen3.5-9B, produced with NVIDIA Model Optimizer using the nvfp4_mlp_only_mse-fp8_cast_kv recipe from PR #1391.
Quantization details
| Component | Precision | Notes |
|---|---|---|
| MLP weights (32 layers) | NVFP4 (W4A4, block-16, e2m1 / e4m3 scale) | quantized |
self_attn QKVO (8 layers) |
BF16 | preserved |
linear_attn blocks (24 layers) |
BF16 | preserved (Mamba-style hybrid layers) |
embed, lm_head, norm, mtp, visual |
BF16 | preserved |
| KV cache | FP8 | use_constant_amax: true |
| Calibration | MSE + fp8_scale_sweep: true |
static MLP weight scales |
Checkpoint size: 12.38 GB (vs 19.3 GB BF16, −36%).
Evaluation
Δ columns are absolute percentage-point differences vs Qwen/Qwen3.5-9B BF16.
| Metric | BF16 | This model |
|---|---|---|
| MMLU-Pro pass@1 | 82.89 | 82.40 (−0.49) |
| AIME 2025 avg-of-64 | 67.34 | 65.36 (−1.98) |
| AIME 2025 majority@64 | 90.00 | 87.78 (−2.22) |
| LCB pass@3 | 66.08 | 68.72 (+2.64) |
| GPQA avg-of-8 | 81.06 | 80.68 (−0.38) |
| GPQA majority@8 | 83.84 | 83.59 (−0.25) |
| AA-LCR pass@1 [avg-of-3] | 56.33 | 50.67 (−5.66) |
| AA-LCR pass@3 | 71.00 | 66.00 (−5.00) |
| τ²-bench-telecom pass@1 | 15.79 | 12.28 (−3.51) |
Eval was run via nemo-evaluator-launcher using the nemo-skills:26.03 container. AA-LCR judge: Qwen3-235B-A22B-Instruct-2507 (Non-Reasoning).
Usage
vLLM
vllm serve davidyu-nv/Qwen3.5-9B-NVFP4-MSE \
--tensor-parallel-size 2 \
--data-parallel-size 4 \
--reasoning-parser qwen3 \
--max-model-len 131072 \
--trust-remote-code \
--disable-custom-all-reduce \
--no-enable-prefix-caching
For tool-calling workloads (e.g. τ²-bench), also pass:
--enable-auto-tool-choice --tool-call-parser hermes
Container
Tested with nvcr.io/nvstaging/nim/vllm-modelopt:v0.19.1.
License
Apache 2.0, inherited from Qwen/Qwen3.5-9B.
Acknowledgments
- Downloads last month
- 23