NVIDIA GeForce RTX 5090
I wanted to share my local setup and benchmark results for nvidia/Qwen3.6-35B-A3B-NVFP4, since it shows what seems possible on a single consumer Blackwell GPU when the runtime is tuned for maximum context rather than throughput.
192K profile is roughly 100-170+ tok/s, with recent logs averaging about 145 tok/s.
Hardware
- GPU: NVIDIA GeForce RTX 5090, 32 GB VRAM
- Runtime: Docker Desktop with WSL2 backend
- NVIDIA driver: 610.47
- Docker engine: 29.4.3
Runtime setup
- vLLM image:
vllm/vllm-openai:cu130-nightly - vLLM version:
0.19.2rc1.dev134+gfe9c3d6c5.cu130 - Model:
nvidia/Qwen3.6-35B-A3B-NVFP4 - Served model name:
qwen36 - Quantization:
modelopt - KV cache:
fp8 - Tensor parallel size:
1 - CPU offload: not used
For the max-context profiles I used:
--max-num-seqs 1
--kv-cache-dtype fp8
--async-scheduling
--enable-chunked-prefill
--language-model-only
--skip-mm-profiling
--no-enable-prefix-caching
--no-calculate-kv-scales
The highest stable no-offload profile I measured was:
--max-model-len 196608
--max-num-batched-tokens 384
--gpu-memory-utilization 0.94
Results
| Profile | Max model len | Long recall prompt tokens | Long recall latency, prefill + decode | Peak VRAM MiB | Result |
|---|---|---|---|---|---|
boot32k |
32,768 | 28,447 | 2.268 s | 28,763 | Pass |
speed64k_agent |
65,536 | 56,867 | 4.801 s | 30,745 | Pass |
maxctx128k_text |
131,072 | 113,663 | 17.602 s | 31,109 | Pass |
maxctx144k_text_probe |
147,456 | 127,894 | 26.041 s | 31,785 | Pass |
maxctx160k_text_probe |
163,840 | 142,083 | 25.721 s | 32,018 | Pass |
maxctx192k_text_probe |
196,608 | 170,518 | 40.194 s | 31,958 | Pass |
Each passing profile validated:
- vLLM startup
/v1/models- chat completion
- long-context marker recall
- VRAM sampling
- basic instruction-following and JSON output quality
Practical takeaway
On this machine, maxctx192k_text_probe was the highest stable no-offload context profile I measured. It is close to the 32 GB VRAM ceiling, so I would treat it as a dedicated-GPU max-context profile.
For daily coding or agent use, speed64k_agent is more practical because it keeps prefix caching and tool-call parsing enabled.
I did not enable native 262K context. The 192K run reported about 245K GPU KV cache capacity, below the modelβs native 262,144-token target, so 262K looked too speculative and OOM-prone for this setup.
Main caveat: these numbers are specific to this exact hardware, driver, Docker/vLLM image, and profile configuration. Future vLLM nightlies may change memory behavior.
Hi, could you paste your command, I use the following in "docker compose.yml", error information "[out of memory](RuntimeError: CUDA driver error: out of memory)" οΌmy GPU is 5090D 32G
services:
vllm:
image: vllm/vllm-openai:cu130-nightly
container_name: qwen3.6-35b-nvfp4
runtime: nvidia
ports:
- "8000:8000"
ipc: host
shm_size: 2g
environment:
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
volumes:
- "E:\vllm\model:/models"
command: >
--model /models/Qwen3.6-35B-A3B-NVFP4
--served-model-name qwen36
--trust-remote-code
--dtype auto
--kv-cache-dtype fp8
--max-model-len 16384
--gpu-memory-utilization 0.75
--max-num-seqs 4
--max-num-batched-tokens 4096
--enable-chunked-prefill
--enforce-eager
--attention-backend FLASHINFER
--disable-custom-all-reduce
--no-enable-prefix-caching
--language-model-only
--reasoning-parser qwen3
Also got the 5090 running well this evening β real work I threw at it averaged ~650 tokens/sec. Tuned it for batching as follows:
Great writeup @Mosai-Sys β thanks for posting this. This setup is tuned the same GPU for concurrency/throughput, running it as a native venv rather than the Docker nightly. Adding my numbers + config, plus a couple of things that should help @qujian78 's OOM.
Concurrency-tuned profile (single RTX 5090, 32 GB)
Real workloads averaged 650 tok/s. Under a synthetic load test, vLLM's own server-side counters reported:
- Generation: peak 788.1 tok/s (other 10s windows: 767.2 / 762.8 / 716.0 / 700.4 tok/s), sustained across 32β90 concurrent requests
- Prefill running simultaneously: 1,000β1,600 tok/s (peak window 1,590 tok/s); peak combined prefill+gen β 2,165 tok/s
- KV pressure drove to 99.2% at 90 running + 6 queued before backpressure kicked in
- KV cache: 681,574 tokens (7.29 GiB, fp8) β 10.40Γ concurrency at 65,536-token context
- Single-stream decode: 256 tokens in 19.9 s = 12.9 tok/s (eager-mode floor β see note)
- Weights resident: 20.2 GiB; engine init (profile + KV + warmup) 17.4 s
@qujian78 β on the OOM
A few things, one counter-intuitive:
- Your
--gpu-memory-utilization 0.75is probably too low, not too high. With ~20 GB of weights, the non-KV peak (weights + Marlin/FP8 workspace + graph capture) is roughly fixed at ~29 GB. A smaller util budget leaves negative room for KV during profiling, so it OOMs. Counter-intuitively, raise util to ~0.90β0.92 and shrink the fixed peak instead (next point). On my box the same KV-profiling allocation (1.02 GiB) failed at util 0.80β0.90 and only fit once I dropped the fixed peak. - Drop the fixed peak with
--enforce-eager(skips the CUDA-graph capture pools β the single biggest chunk that won't fit on 32 GB) and a small--max-num-batched-tokens(Mosai-Sys uses384; I use2048). That shrinks the profiling-forward activation. - Borrow Mosai-Sys's memory-savers:
--skip-mm-profiling --no-calculate-kv-scales --async-scheduling. --disable-custom-all-reduceis a no-op on a single GPU β harmless but unnecessary.
Native-venv config (for anyone not using the Docker nightly)
vLLM 0.22.0 (stable), torch 2.11.0+cu130, flashinfer 0.6.11.post2, dedicated venv. Serves a 64K-context, batching-tuned endpoint:
# CU13 = <venv>/lib/python3.12/site-packages/nvidia/cu13 (the bundled CUDA 13.3 toolkit)
export CUDA_HOME=$CU13
export FLASHINFER_NVCC="$CUDA_HOME/bin/nvcc"
export FLASHINFER_CUDA_ARCH_LIST="12.0f" # sm_120 family ('0a' is 12.x minor!=0 / Spark)
export NVCC_PREPEND_FLAGS="-DCCCL_DISABLE_CTK_COMPATIBILITY_CHECK"
export LIBRARY_PATH="$CUDA_HOME/lib:$LIBRARY_PATH"
export LD_LIBRARY_PATH="$CUDA_HOME/lib:$LD_LIBRARY_PATH"
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export VLLM_USE_FLASHINFER_MOE_FP4=0 # MoE -> Marlin (correct for W4A16)
export VLLM_USE_FLASHINFER_SAMPLER=0
# one-time: the cu13 wheel ships libcudart.so.13 with no dev symlink
cd "$CUDA_HOME/lib" && for f in *.so.*; do b=$(echo "$f" | sed -E 's/\.so\..*/.so/'); [ -e "$b" ] || ln -s "$f" "$b"; done
vllm serve <model> \
--served-model-name Qwen3.6-35B-A3B-NVFP4 --port 8000 \
--language-model-only --quantization modelopt_mixed \
--attention-backend TRITON_ATTN \
--kv-cache-dtype fp8 --enforce-eager \
--gpu-memory-utilization 0.92 \
--max-model-len 65536 --max-num-batched-tokens 2048 \
--reasoning-parser qwen3 --trust-remote-code
Two native-install gotchas (the Docker cu130-nightly image probably avoids both)
If you go the native venv (non-Docker) route on stable vLLM, these bit me β and since @Mosai-Sys mounts the raw model in the cu130-nightly container without hitting either, the nightly likely already handles them:
- flashinfer JIT on sm_120. Stock, it either can't find a Blackwell arch (
No supported CUDA architectures found for major versions [12]) or shells out to a system nvcc < 12.9 (nvcc fatal: Unsupported gpu architecture 'compute_120f'). Fix = theFLASHINFER_CUDA_ARCH_LIST="12.0f"+CUDA_HOMEβcu13 +libcudart.sosymlink steps above. (Or just use the cu130-nightly Docker image, which bundles a consistent toolchain.) - Quantized
lm_head. On vLLM 0.22.0 stable, this checkpoint fails to load withValueError: no parameter named 'lm_head.input_scale'β vLLM'sQwen3_5buildslm_headas an unquantizedParallelLMHeadand rejects the NVFP4 head's scale tensors. My workaround was to dequantize justlm_headβ bf16 offline (vLLM's ownbreak_fp4_bytes+ the on-disk group-16 block scales) and drop it fromquantized_layers; all MoE/attn layers stay native NVFP4/FP8. @Mosai-Sys β did you hit this on the cu130-nightly, or does it load the head fine? If nightly handles it, the fix has already landed upstream and the stable build just needs to catch up.
Update: this is tracked upstream as vLLM #40999 β Qwen3.5 builds lm_head as a ParallelLMHead without passing quant_config, so the quantized head is never wired up. So it's a known vLLM gap, not checkpoint-specific
Thanks, this is very useful data. Your profile is clearly optimized for concurrency/throughput, while mine was optimized for maximum single-sequence context length, so the numbers are not directly comparable.
For the lm_head issue: yes, I hit the same class of problem. In my reproducible Docker path I did not dequantize lm_head; instead I apply a narrow runtime patch inside the container before vllm serve starts. The patch does three things:
- normalizes the ModelOpt NVFP4 algorithm names,
- passes
quant_configinto QwenParallelLMHead, - loads the packed NVFP4
lm_headtensors with a linear-style loader rather than the embedding loader.
So if the question is whether stock vLLM handles the quantized head cleanly out of the box: in my current path, no. I still patch it. The model weights stay unchanged; the patch is applied only to the ephemeral container filesystem.
Also agree on the OOM point: just lowering --gpu-memory-utilization is not always safer. If the fixed peak is too high, lowering the KV budget can still fail during profiling. For people trying to boot first, I would reduce fixed pressure before chasing context:
--max-num-seqs 1
--max-num-batched-tokens 2048 or lower
--kv-cache-dtype fp8
--quantization modelopt
--language-model-only
--skip-mm-profiling
--no-calculate-kv-scales
For my high-context profile I do not tune for throughput. Current 192K profile is more conservative after feedback, leaving some VRAM headroom:
--max-model-len 196608
--max-num-seqs 1
--max-num-batched-tokens 384
--gpu-memory-utilization 0.93
--kv-cache-dtype fp8
--quantization modelopt
--async-scheduling
--enable-chunked-prefill
--language-model-only
--skip-mm-profiling
--no-enable-prefix-caching
--no-calculate-kv-scales
--max-cudagraph-capture-size 64
--attention-backend flashinfer
--moe-backend marlin
For @qujian78 I would first validate a single-sequence 32K boot profile, then step up. I would not start with 4 sequences and 4096 batched tokens on a 32 GB card until the basic boot path is stable.
With llama-server I get 130 t/s on simple GGUF, would this model give me more speed with vllm?
It depends what you mean by speed.
For single-user, single-stream decode, I would not expect a huge automatic win over a well-tuned GGUF/llama.cpp setup. On my setup, interactive decode is usually in the same practical range, roughly around 100-170 tok/s depending on prompt, context length, settings, and measurement method. If you already get 130 tok/s in llama-server, vLLM may not feel dramatically faster for one simple stream.
Where vLLM helps more is:
- batching / concurrent users
- OpenAI-compatible serving
- long-context scheduling
- FP8 KV cache
- higher aggregate throughput
- production-style serving controls
So for one local chat session, llama.cpp/GGUF can be very competitive. For serving multiple requests or doing long-context workloads, vLLM is the more interesting path.
Also note that this is not the same model format: this is the NVIDIA NVFP4 / ModelOpt checkpoint running through vLLM, not a GGUF conversion. My setup is optimized mainly for long context on a single RTX 5090, not maximum single-stream tok/s.