Instructions to use NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4") model = AutoModelForImageTextToText.from_pretrained("NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4
- SGLang
How to use NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4 with Docker Model Runner:
docker model run hf.co/NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4
NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence.
🌟 Qwen3.6-35B-A3B NVFP4 Quantization by NeuralNet 🧠🤖
This is an NVFP4-quantized version of Qwen/Qwen3.6-35B-A3B, optimized for deployment on NVIDIA Blackwell architecture GPUs using vLLM.
NVFP4 quantization requires NVIDIA Blackwell architecture (GB200, RTX 5000 series, etc.). This format is not compatible with Ampere, Ada Lovelace, or Hopper GPUs. If you are running on an older GPU, please use a different quantization format.
Original model: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
Quantization Details
This model was quantized to NVFP4 (4-bit NVIDIA Floating Point) using vLLM's built-in quantization pipeline. NVFP4 leverages native FP4 Tensor Core support introduced in Blackwell GPUs, delivering significant memory savings and throughput improvements with minimal quality degradation compared to BF16.
vllm quantize \
--model Qwen/Qwen3.6-35B-A3B \
--quantization nvfp4 \
--output-dir NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4
⚡ Deployment with vLLM
This quantized model is intended to be served using vLLM (vllm>=0.9.0 recommended).
Quick Start
vllm serve NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4 \
--quantization nvfp4 \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Using a Config File
# Deploy with: vllm serve --config config.yaml
# Optimized for NVIDIA RTX 6000 PRO (Blackwell)
# Benchmarked: ~85-90 parallel requests, up to 1000 tok/sec at higher context lengths
model: NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4
dtype: bfloat16
kv-cache-dtype: fp8
gpu-memory-utilization: 0.95
max-model-len: 262144
max-num-batched-tokens: 4096
max-num-seqs: 200
max-cudagraph-capture-size: 209
enable-prefix-caching: true
trust-remote-code: true
reasoning-parser: qwen3
enable-auto-tool-choice: true
tool-call-parser: qwen3_coder
default-chat-template-kwargs: '{"enable_thinking": false}'
download-dir: /workspace/models
host: 0.0.0.0
port: 18000
vllm serve --config config.yaml
💬 Chat API Usage
Qwen3.6 uses a standard chat template compatible with OpenAI-format APIs. Thinking mode is enabled by default.
Thinking Mode (Default)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:18000/v1", api_key="EMPTY")
messages = [{"role": "user", "content": "Your message here"}]
response = client.chat.completions.create(
model="NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4",
messages=messages,
max_tokens=32768,
temperature=1.0,
top_p=0.95,
extra_body={"top_k": 20},
)
print(response.choices[0].message.content)
Non-Thinking (Instruct) Mode
response = client.chat.completions.create(
model="NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4",
messages=messages,
max_tokens=8192,
temperature=0.7,
top_p=0.8,
presence_penalty=1.5,
extra_body={
"top_k": 20,
"chat_template_kwargs": {"enable_thinking": False},
},
)
Image Input
messages = [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
response = client.chat.completions.create(
model="NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4",
messages=messages,
max_tokens=32768,
temperature=1.0,
top_p=0.95,
extra_body={"top_k": 20},
)
⚙️ Recommended Sampling Parameters
| Mode | temperature | top_p | top_k | presence_penalty |
|---|---|---|---|---|
| Thinking — general tasks | 1.0 | 0.95 | 20 | 0.0 |
| Thinking — precise coding | 0.6 | 0.95 | 20 | 0.0 |
| Instruct (non-thinking) | 0.7 | 0.80 | 20 | 1.5 |
📥 Download with huggingface-cli
Install the CLI
pip install -U "huggingface_hub[cli]"
Download the Full Repository
huggingface-cli download NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4 --local-dir ./Qwen3.6-35B-A3B-NVFP4
Download Specific Files
huggingface-cli download NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4 \
--include "*.safetensors" \
--local-dir ./Qwen3.6-35B-A3B-NVFP4
🔧 Hardware Requirements
| Component | Requirement |
|---|---|
| GPU Architecture | NVIDIA Blackwell (sm_100+) |
| VRAM | 24 GB+ recommended |
| CUDA | 12.8+ |
| vLLM | 0.9.0+ |
NVFP4 is exclusively supported on NVIDIA Blackwell GPUs. Attempting to run this model on Ampere (A100), Ada Lovelace (RTX 4000), or Hopper (H100) will fail. For those architectures, use the original BF16 model or an AWQ/GPTQ quantized variant.
🌐 Contact Us
NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence.
Website: https://neuralnet.solutions Email: info[at]neuralnet.solutions
- Downloads last month
- 910
Model tree for NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4
Base model
Qwen/Qwen3.6-35B-A3B