NVIDIA GeForce RTX 5090

by Mosai-Sys - opened 4 days ago

I wanted to share my local setup and benchmark results for nvidia/Qwen3.6-35B-A3B-NVFP4, since it shows what seems possible on a single consumer Blackwell GPU when the runtime is tuned for maximum context rather than throughput.

192K profile is roughly 100-170+ tok/s, with recent logs averaging about 145 tok/s.

Hardware

GPU: NVIDIA GeForce RTX 5090, 32 GB VRAM
Runtime: Docker Desktop with WSL2 backend
NVIDIA driver: 610.47
Docker engine: 29.4.3

Runtime setup

vLLM image: vllm/vllm-openai:cu130-nightly
vLLM version: 0.19.2rc1.dev134+gfe9c3d6c5.cu130
Model: nvidia/Qwen3.6-35B-A3B-NVFP4
Served model name: qwen36
Quantization: modelopt
KV cache: fp8
Tensor parallel size: 1
CPU offload: not used

For the max-context profiles I used:

--max-num-seqs 1
--kv-cache-dtype fp8
--async-scheduling
--enable-chunked-prefill
--language-model-only
--skip-mm-profiling
--no-enable-prefix-caching
--no-calculate-kv-scales

The highest stable no-offload profile I measured was:

--max-model-len 196608
--max-num-batched-tokens 384
--gpu-memory-utilization 0.94

Results

Profile	Max model len	Long recall prompt tokens	Long recall latency, prefill + decode	Peak VRAM MiB	Result
`boot32k`	32,768	28,447	2.268 s	28,763	Pass
`speed64k_agent`	65,536	56,867	4.801 s	30,745	Pass
`maxctx128k_text`	131,072	113,663	17.602 s	31,109	Pass
`maxctx144k_text_probe`	147,456	127,894	26.041 s	31,785	Pass
`maxctx160k_text_probe`	163,840	142,083	25.721 s	32,018	Pass
`maxctx192k_text_probe`	196,608	170,518	40.194 s	31,958	Pass

Each passing profile validated:

vLLM startup
/v1/models
chat completion
long-context marker recall
VRAM sampling
basic instruction-following and JSON output quality

Practical takeaway

On this machine, maxctx192k_text_probe was the highest stable no-offload context profile I measured. It is close to the 32 GB VRAM ceiling, so I would treat it as a dedicated-GPU max-context profile.

For daily coding or agent use, speed64k_agent is more practical because it keeps prefix caching and tool-call parsing enabled.

I did not enable native 262K context. The 192K run reported about 245K GPU KV cache capacity, below the model’s native 262,144-token target, so 262K looked too speculative and OOM-prone for this setup.

Main caveat: these numbers are specific to this exact hardware, driver, Docker/vLLM image, and profile configuration. Future vLLM nightlies may change memory behavior.

qujian78

3 days ago

Hi, could you paste your command, I use the following in "docker compose.yml", error information "[out of memory](RuntimeError: CUDA driver error: out of memory)" ，my GPU is 5090D 32G
services:
vllm:
image: vllm/vllm-openai:cu130-nightly
container_name: qwen3.6-35b-nvfp4
runtime: nvidia
ports:
- "8000:8000"
ipc: host
shm_size: 2g
environment:
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
volumes:
- "E:\vllm\model:/models"
command: >
--model /models/Qwen3.6-35B-A3B-NVFP4
--served-model-name qwen36
--trust-remote-code
--dtype auto
--kv-cache-dtype fp8
--max-model-len 16384
--gpu-memory-utilization 0.75
--max-num-seqs 4
--max-num-batched-tokens 4096
--enable-chunked-prefill
--enforce-eager
--attention-backend FLASHINFER
--disable-custom-all-reduce
--no-enable-prefix-caching
--language-model-only
--reasoning-parser qwen3

whistlercapital

3 days ago

•

edited 3 days ago

Also got the 5090 running well this evening — real work I threw at it averaged ~650 tokens/sec. Tuned it for batching as follows:

Great writeup @Mosai-Sys — thanks for posting this. This setup is tuned the same GPU for concurrency/throughput, running it as a native venv rather than the Docker nightly. Adding my numbers + config, plus a couple of things that should help @qujian78 's OOM.

Concurrency-tuned profile (single RTX 5090, 32 GB)

Real workloads averaged 650 tok/s. Under a synthetic load test, vLLM's own server-side counters reported:

Generation: peak 788.1 tok/s (other 10s windows: 767.2 / 762.8 / 716.0 / 700.4 tok/s), sustained across 32–90 concurrent requests
Prefill running simultaneously: 1,000–1,600 tok/s (peak window 1,590 tok/s); peak combined prefill+gen ≈ 2,165 tok/s
KV pressure drove to 99.2% at 90 running + 6 queued before backpressure kicked in
KV cache: 681,574 tokens (7.29 GiB, fp8) → 10.40× concurrency at 65,536-token context
Single-stream decode: 256 tokens in 19.9 s = 12.9 tok/s (eager-mode floor — see note)
Weights resident: 20.2 GiB; engine init (profile + KV + warmup) 17.4 s

@qujian78 — on the OOM

A few things, one counter-intuitive:

Your --gpu-memory-utilization 0.75 is probably too low, not too high. With ~20 GB of weights, the non-KV peak (weights + Marlin/FP8 workspace + graph capture) is roughly fixed at ~29 GB. A smaller util budget leaves negative room for KV during profiling, so it OOMs. Counter-intuitively, raise util to ~0.90–0.92 and shrink the fixed peak instead (next point). On my box the same KV-profiling allocation (1.02 GiB) failed at util 0.80–0.90 and only fit once I dropped the fixed peak.
Drop the fixed peak with --enforce-eager (skips the CUDA-graph capture pools — the single biggest chunk that won't fit on 32 GB) and a small --max-num-batched-tokens (Mosai-Sys uses 384; I use 2048). That shrinks the profiling-forward activation.
Borrow Mosai-Sys's memory-savers: --skip-mm-profiling --no-calculate-kv-scales --async-scheduling.
--disable-custom-all-reduce is a no-op on a single GPU — harmless but unnecessary.

Native-venv config (for anyone not using the Docker nightly)

vLLM 0.22.0 (stable), torch 2.11.0+cu130, flashinfer 0.6.11.post2, dedicated venv. Serves a 64K-context, batching-tuned endpoint:

# CU13 = <venv>/lib/python3.12/site-packages/nvidia/cu13   (the bundled CUDA 13.3 toolkit)
export CUDA_HOME=$CU13
export FLASHINFER_NVCC="$CUDA_HOME/bin/nvcc"
export FLASHINFER_CUDA_ARCH_LIST="12.0f"        # sm_120 family ('0a' is 12.x minor!=0 / Spark)
export NVCC_PREPEND_FLAGS="-DCCCL_DISABLE_CTK_COMPATIBILITY_CHECK"
export LIBRARY_PATH="$CUDA_HOME/lib:$LIBRARY_PATH"
export LD_LIBRARY_PATH="$CUDA_HOME/lib:$LD_LIBRARY_PATH"
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export VLLM_USE_FLASHINFER_MOE_FP4=0            # MoE -> Marlin (correct for W4A16)
export VLLM_USE_FLASHINFER_SAMPLER=0

# one-time: the cu13 wheel ships libcudart.so.13 with no dev symlink
cd "$CUDA_HOME/lib" && for f in *.so.*; do b=$(echo "$f" | sed -E 's/\.so\..*/.so/'); [ -e "$b" ] || ln -s "$f" "$b"; done

vllm serve <model> \
  --served-model-name Qwen3.6-35B-A3B-NVFP4 --port 8000 \
  --language-model-only --quantization modelopt_mixed \
  --attention-backend TRITON_ATTN \
  --kv-cache-dtype fp8 --enforce-eager \
  --gpu-memory-utilization 0.92 \
  --max-model-len 65536 --max-num-batched-tokens 2048 \
  --reasoning-parser qwen3 --trust-remote-code

Two native-install gotchas (the Docker cu130-nightly image probably avoids both)

If you go the native venv (non-Docker) route on stable vLLM, these bit me — and since @Mosai-Sys mounts the raw model in the cu130-nightly container without hitting either, the nightly likely already handles them:

flashinfer JIT on sm_120. Stock, it either can't find a Blackwell arch (No supported CUDA architectures found for major versions [12]) or shells out to a system nvcc < 12.9 (nvcc fatal: Unsupported gpu architecture 'compute_120f'). Fix = the FLASHINFER_CUDA_ARCH_LIST="12.0f" + CUDA_HOME→cu13 + libcudart.so symlink steps above. (Or just use the cu130-nightly Docker image, which bundles a consistent toolchain.)
Quantized lm_head. On vLLM 0.22.0 stable, this checkpoint fails to load with ValueError: no parameter named 'lm_head.input_scale' — vLLM's Qwen3_5 builds lm_head as an unquantized ParallelLMHead and rejects the NVFP4 head's scale tensors. My workaround was to dequantize just lm_head → bf16 offline (vLLM's own break_fp4_bytes + the on-disk group-16 block scales) and drop it from quantized_layers; all MoE/attn layers stay native NVFP4/FP8. @Mosai-Sys — did you hit this on the cu130-nightly, or does it load the head fine? If nightly handles it, the fix has already landed upstream and the stable build just needs to catch up.
Update: this is tracked upstream as vLLM #40999 — Qwen3.5 builds lm_head as a ParallelLMHead without passing quant_config, so the quantized head is never wired up. So it's a known vLLM gap, not checkpoint-specific

Mosai-Sys

3 days ago

Thanks, this is very useful data. Your profile is clearly optimized for concurrency/throughput, while mine was optimized for maximum single-sequence context length, so the numbers are not directly comparable.

For the lm_head issue: yes, I hit the same class of problem. In my reproducible Docker path I did not dequantize lm_head; instead I apply a narrow runtime patch inside the container before vllm serve starts. The patch does three things:

normalizes the ModelOpt NVFP4 algorithm names,
passes quant_config into Qwen ParallelLMHead,
loads the packed NVFP4 lm_head tensors with a linear-style loader rather than the embedding loader.

So if the question is whether stock vLLM handles the quantized head cleanly out of the box: in my current path, no. I still patch it. The model weights stay unchanged; the patch is applied only to the ephemeral container filesystem.

Also agree on the OOM point: just lowering --gpu-memory-utilization is not always safer. If the fixed peak is too high, lowering the KV budget can still fail during profiling. For people trying to boot first, I would reduce fixed pressure before chasing context:

--max-num-seqs 1
--max-num-batched-tokens 2048 or lower
--kv-cache-dtype fp8
--quantization modelopt
--language-model-only
--skip-mm-profiling
--no-calculate-kv-scales

For my high-context profile I do not tune for throughput. Current 192K profile is more conservative after feedback, leaving some VRAM headroom:

--max-model-len 196608
--max-num-seqs 1
--max-num-batched-tokens 384
--gpu-memory-utilization 0.93
--kv-cache-dtype fp8
--quantization modelopt
--async-scheduling
--enable-chunked-prefill
--language-model-only
--skip-mm-profiling
--no-enable-prefix-caching
--no-calculate-kv-scales
--max-cudagraph-capture-size 64
--attention-backend flashinfer
--moe-backend marlin

For @qujian78 I would first validate a single-sequence 32K boot profile, then step up. I would not start with 4 sequences and 4096 batched tokens on a 32 GB card until the basic boot path is stable.

JLouisBiz

3 days ago

With llama-server I get 130 t/s on simple GGUF, would this model give me more speed with vllm?

Mosai-Sys

3 days ago

It depends what you mean by speed.

For single-user, single-stream decode, I would not expect a huge automatic win over a well-tuned GGUF/llama.cpp setup. On my setup, interactive decode is usually in the same practical range, roughly around 100-170 tok/s depending on prompt, context length, settings, and measurement method. If you already get 130 tok/s in llama-server, vLLM may not feel dramatically faster for one simple stream.

Where vLLM helps more is:

batching / concurrent users
OpenAI-compatible serving
long-context scheduling
FP8 KV cache
higher aggregate throughput
production-style serving controls

So for one local chat session, llama.cpp/GGUF can be very competitive. For serving multiple requests or doing long-context workloads, vLLM is the more interesting path.

Also note that this is not the same model format: this is the NVIDIA NVFP4 / ModelOpt checkpoint running through vLLM, not a GGUF conversion. My setup is optimized mainly for long context on a single RTX 5090, not maximum single-stream tok/s.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment