Instructions to use saricles/Qwen3-Coder-Next-NVFP4-GB10 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use saricles/Qwen3-Coder-Next-NVFP4-GB10 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="saricles/Qwen3-Coder-Next-NVFP4-GB10") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("saricles/Qwen3-Coder-Next-NVFP4-GB10") model = AutoModelForCausalLM.from_pretrained("saricles/Qwen3-Coder-Next-NVFP4-GB10") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use saricles/Qwen3-Coder-Next-NVFP4-GB10 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "saricles/Qwen3-Coder-Next-NVFP4-GB10" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "saricles/Qwen3-Coder-Next-NVFP4-GB10", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/saricles/Qwen3-Coder-Next-NVFP4-GB10
- SGLang
How to use saricles/Qwen3-Coder-Next-NVFP4-GB10 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "saricles/Qwen3-Coder-Next-NVFP4-GB10" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "saricles/Qwen3-Coder-Next-NVFP4-GB10", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "saricles/Qwen3-Coder-Next-NVFP4-GB10" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "saricles/Qwen3-Coder-Next-NVFP4-GB10", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use saricles/Qwen3-Coder-Next-NVFP4-GB10 with Docker Model Runner:
docker model run hf.co/saricles/Qwen3-Coder-Next-NVFP4-GB10
NVFP4-GB10 production feedback + reproducible-bench setup β recipe?
Hi Michael,
Quick context: I'm running your Qwen3-Coder-Next-NVFP4-GB10 in production on a DGX Spark via LocalAI + vLLM 0.23.0 with FlashInfer-Cutlass kernels. Real workload is a Hermes-CLI coding agent doing tool-call sessions, hitting ~62 tok/s steady-state. Solid build β thanks for the work.
I'm currently building a small reproducible benchmark suite (TTFT, throughput, HumanEval-pass-rate, tool-call compliance) that compares NVFP4 builds against each other under realistic streaming-with-tools workloads. The plan is to feed the results back to anyone whose build I include. As context for the benchmark itself: I just landed a vLLM streaming PR for progressive emission with active tool parsers (mudler/LocalAI#10351, with E2E numbers against your build).
Would you be open to sharing your llm-compressor recipe (oneshot/Quant config + recipe yaml/py)? I'd like to make sure I'm reproducing your build exactly as the baseline before testing variant calibrations (code-focused datasets, larger sample counts, ignore-list tweaks). Happy to keep it private if you'd prefer not to publish it publicly yet.
In return I can share the benchmark harness once it's stable and feed quality-numbers back to you per build β useful signal if you want to publish better quality cards on the models.
If you'd rather not share the recipe β totally fine, I'll reconstruct from the model-card hints. Mostly wanted to ask first since you've got the ground-truth setup.
Cheers,
Philipp
Closed β switching to a private DM instead. Apologies for the noise.
Re-opening β apologies for the noise.