NVIDIA GeForce RTX 5090

#2
by Mosai-Sys - opened

I wanted to share my local setup and benchmark results for nvidia/Qwen3.6-35B-A3B-NVFP4, since it shows what seems possible on a single consumer Blackwell GPU when the runtime is tuned for maximum context rather than throughput.

192K profile is roughly 100-170+ tok/s, with recent logs averaging about 145 tok/s.

Hardware

  • GPU: NVIDIA GeForce RTX 5090, 32 GB VRAM
  • Runtime: Docker Desktop with WSL2 backend
  • NVIDIA driver: 610.47
  • Docker engine: 29.4.3

Runtime setup

  • vLLM image: vllm/vllm-openai:cu130-nightly
  • vLLM version: 0.19.2rc1.dev134+gfe9c3d6c5.cu130
  • Model: nvidia/Qwen3.6-35B-A3B-NVFP4
  • Served model name: qwen36
  • Quantization: modelopt
  • KV cache: fp8
  • Tensor parallel size: 1
  • CPU offload: not used

For the max-context profiles I used:

--max-num-seqs 1
--kv-cache-dtype fp8
--async-scheduling
--enable-chunked-prefill
--language-model-only
--skip-mm-profiling
--no-enable-prefix-caching
--no-calculate-kv-scales

The highest stable no-offload profile I measured was:

--max-model-len 196608
--max-num-batched-tokens 384
--gpu-memory-utilization 0.94

Results

Profile Max model len Long recall prompt tokens Long recall latency, prefill + decode Peak VRAM MiB Result
boot32k 32,768 28,447 2.268 s 28,763 Pass
speed64k_agent 65,536 56,867 4.801 s 30,745 Pass
maxctx128k_text 131,072 113,663 17.602 s 31,109 Pass
maxctx144k_text_probe 147,456 127,894 26.041 s 31,785 Pass
maxctx160k_text_probe 163,840 142,083 25.721 s 32,018 Pass
maxctx192k_text_probe 196,608 170,518 40.194 s 31,958 Pass

Each passing profile validated:

  • vLLM startup
  • /v1/models
  • chat completion
  • long-context marker recall
  • VRAM sampling
  • basic instruction-following and JSON output quality

Practical takeaway

On this machine, maxctx192k_text_probe was the highest stable no-offload context profile I measured. It is close to the 32 GB VRAM ceiling, so I would treat it as a dedicated-GPU max-context profile.

For daily coding or agent use, speed64k_agent is more practical because it keeps prefix caching and tool-call parsing enabled.

I did not enable native 262K context. The 192K run reported about 245K GPU KV cache capacity, below the model’s native 262,144-token target, so 262K looked too speculative and OOM-prone for this setup.

Main caveat: these numbers are specific to this exact hardware, driver, Docker/vLLM image, and profile configuration. Future vLLM nightlies may change memory behavior.

Hi, could you paste your command, I use the following in "docker compose.yml", error information "[out of memory](RuntimeError: CUDA driver error: out of memory)" ,my GPU is 5090D 32G
services:
vllm:
image: vllm/vllm-openai:cu130-nightly
container_name: qwen3.6-35b-nvfp4
runtime: nvidia
ports:
- "8000:8000"
ipc: host
shm_size: 2g
environment:
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
volumes:
- "E:\vllm\model:/models"
command: >
--model /models/Qwen3.6-35B-A3B-NVFP4
--served-model-name qwen36
--trust-remote-code
--dtype auto
--kv-cache-dtype fp8
--max-model-len 16384
--gpu-memory-utilization 0.75
--max-num-seqs 4
--max-num-batched-tokens 4096
--enable-chunked-prefill
--enforce-eager
--attention-backend FLASHINFER
--disable-custom-all-reduce
--no-enable-prefix-caching
--language-model-only
--reasoning-parser qwen3

Also got the 5090 running well this evening β€” real work I threw at it averaged ~650 tokens/sec. Tuned it for batching as follows:

Great writeup @Mosai-Sys β€” thanks for posting this. This setup is tuned the same GPU for concurrency/throughput, running it as a native venv rather than the Docker nightly. Adding my numbers + config, plus a couple of things that should help @qujian78 's OOM.

Concurrency-tuned profile (single RTX 5090, 32 GB)

Real workloads averaged 650 tok/s. Under a synthetic load test, vLLM's own server-side counters reported:

  • Generation: peak 788.1 tok/s (other 10s windows: 767.2 / 762.8 / 716.0 / 700.4 tok/s), sustained across 32–90 concurrent requests
  • Prefill running simultaneously: 1,000–1,600 tok/s (peak window 1,590 tok/s); peak combined prefill+gen β‰ˆ 2,165 tok/s
  • KV pressure drove to 99.2% at 90 running + 6 queued before backpressure kicked in
  • KV cache: 681,574 tokens (7.29 GiB, fp8) β†’ 10.40Γ— concurrency at 65,536-token context
  • Single-stream decode: 256 tokens in 19.9 s = 12.9 tok/s (eager-mode floor β€” see note)
  • Weights resident: 20.2 GiB; engine init (profile + KV + warmup) 17.4 s

@qujian78 β€” on the OOM

A few things, one counter-intuitive:

  1. Your --gpu-memory-utilization 0.75 is probably too low, not too high. With ~20 GB of weights, the non-KV peak (weights + Marlin/FP8 workspace + graph capture) is roughly fixed at ~29 GB. A smaller util budget leaves negative room for KV during profiling, so it OOMs. Counter-intuitively, raise util to ~0.90–0.92 and shrink the fixed peak instead (next point). On my box the same KV-profiling allocation (1.02 GiB) failed at util 0.80–0.90 and only fit once I dropped the fixed peak.
  2. Drop the fixed peak with --enforce-eager (skips the CUDA-graph capture pools β€” the single biggest chunk that won't fit on 32 GB) and a small --max-num-batched-tokens (Mosai-Sys uses 384; I use 2048). That shrinks the profiling-forward activation.
  3. Borrow Mosai-Sys's memory-savers: --skip-mm-profiling --no-calculate-kv-scales --async-scheduling.
  4. --disable-custom-all-reduce is a no-op on a single GPU β€” harmless but unnecessary.

Native-venv config (for anyone not using the Docker nightly)

vLLM 0.22.0 (stable), torch 2.11.0+cu130, flashinfer 0.6.11.post2, dedicated venv. Serves a 64K-context, batching-tuned endpoint:

# CU13 = <venv>/lib/python3.12/site-packages/nvidia/cu13   (the bundled CUDA 13.3 toolkit)
export CUDA_HOME=$CU13
export FLASHINFER_NVCC="$CUDA_HOME/bin/nvcc"
export FLASHINFER_CUDA_ARCH_LIST="12.0f"        # sm_120 family ('0a' is 12.x minor!=0 / Spark)
export NVCC_PREPEND_FLAGS="-DCCCL_DISABLE_CTK_COMPATIBILITY_CHECK"
export LIBRARY_PATH="$CUDA_HOME/lib:$LIBRARY_PATH"
export LD_LIBRARY_PATH="$CUDA_HOME/lib:$LD_LIBRARY_PATH"
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export VLLM_USE_FLASHINFER_MOE_FP4=0            # MoE -> Marlin (correct for W4A16)
export VLLM_USE_FLASHINFER_SAMPLER=0

# one-time: the cu13 wheel ships libcudart.so.13 with no dev symlink
cd "$CUDA_HOME/lib" && for f in *.so.*; do b=$(echo "$f" | sed -E 's/\.so\..*/.so/'); [ -e "$b" ] || ln -s "$f" "$b"; done

vllm serve <model> \
  --served-model-name Qwen3.6-35B-A3B-NVFP4 --port 8000 \
  --language-model-only --quantization modelopt_mixed \
  --attention-backend TRITON_ATTN \
  --kv-cache-dtype fp8 --enforce-eager \
  --gpu-memory-utilization 0.92 \
  --max-model-len 65536 --max-num-batched-tokens 2048 \
  --reasoning-parser qwen3 --trust-remote-code

Two native-install gotchas (the Docker cu130-nightly image probably avoids both)

If you go the native venv (non-Docker) route on stable vLLM, these bit me β€” and since @Mosai-Sys mounts the raw model in the cu130-nightly container without hitting either, the nightly likely already handles them:

  1. flashinfer JIT on sm_120. Stock, it either can't find a Blackwell arch (No supported CUDA architectures found for major versions [12]) or shells out to a system nvcc < 12.9 (nvcc fatal: Unsupported gpu architecture 'compute_120f'). Fix = the FLASHINFER_CUDA_ARCH_LIST="12.0f" + CUDA_HOME→cu13 + libcudart.so symlink steps above. (Or just use the cu130-nightly Docker image, which bundles a consistent toolchain.)
  2. Quantized lm_head. On vLLM 0.22.0 stable, this checkpoint fails to load with ValueError: no parameter named 'lm_head.input_scale' β€” vLLM's Qwen3_5 builds lm_head as an unquantized ParallelLMHead and rejects the NVFP4 head's scale tensors. My workaround was to dequantize just lm_head β†’ bf16 offline (vLLM's own break_fp4_bytes + the on-disk group-16 block scales) and drop it from quantized_layers; all MoE/attn layers stay native NVFP4/FP8. @Mosai-Sys β€” did you hit this on the cu130-nightly, or does it load the head fine? If nightly handles it, the fix has already landed upstream and the stable build just needs to catch up.
    Update: this is tracked upstream as vLLM #40999 β€” Qwen3.5 builds lm_head as a ParallelLMHead without passing quant_config, so the quantized head is never wired up. So it's a known vLLM gap, not checkpoint-specific

Thanks, this is very useful data. Your profile is clearly optimized for concurrency/throughput, while mine was optimized for maximum single-sequence context length, so the numbers are not directly comparable.

For the lm_head issue: yes, I hit the same class of problem. In my reproducible Docker path I did not dequantize lm_head; instead I apply a narrow runtime patch inside the container before vllm serve starts. The patch does three things:

  • normalizes the ModelOpt NVFP4 algorithm names,
  • passes quant_config into Qwen ParallelLMHead,
  • loads the packed NVFP4 lm_head tensors with a linear-style loader rather than the embedding loader.

So if the question is whether stock vLLM handles the quantized head cleanly out of the box: in my current path, no. I still patch it. The model weights stay unchanged; the patch is applied only to the ephemeral container filesystem.

Also agree on the OOM point: just lowering --gpu-memory-utilization is not always safer. If the fixed peak is too high, lowering the KV budget can still fail during profiling. For people trying to boot first, I would reduce fixed pressure before chasing context:

--max-num-seqs 1
--max-num-batched-tokens 2048 or lower
--kv-cache-dtype fp8
--quantization modelopt
--language-model-only
--skip-mm-profiling
--no-calculate-kv-scales

For my high-context profile I do not tune for throughput. Current 192K profile is more conservative after feedback, leaving some VRAM headroom:

--max-model-len 196608
--max-num-seqs 1
--max-num-batched-tokens 384
--gpu-memory-utilization 0.93
--kv-cache-dtype fp8
--quantization modelopt
--async-scheduling
--enable-chunked-prefill
--language-model-only
--skip-mm-profiling
--no-enable-prefix-caching
--no-calculate-kv-scales
--max-cudagraph-capture-size 64
--attention-backend flashinfer
--moe-backend marlin

For @qujian78 I would first validate a single-sequence 32K boot profile, then step up. I would not start with 4 sequences and 4096 batched tokens on a 32 GB card until the basic boot path is stable.

With llama-server I get 130 t/s on simple GGUF, would this model give me more speed with vllm?

It depends what you mean by speed.

For single-user, single-stream decode, I would not expect a huge automatic win over a well-tuned GGUF/llama.cpp setup. On my setup, interactive decode is usually in the same practical range, roughly around 100-170 tok/s depending on prompt, context length, settings, and measurement method. If you already get 130 tok/s in llama-server, vLLM may not feel dramatically faster for one simple stream.

Where vLLM helps more is:

  • batching / concurrent users
  • OpenAI-compatible serving
  • long-context scheduling
  • FP8 KV cache
  • higher aggregate throughput
  • production-style serving controls

So for one local chat session, llama.cpp/GGUF can be very competitive. For serving multiple requests or doing long-context workloads, vLLM is the more interesting path.

Also note that this is not the same model format: this is the NVIDIA NVFP4 / ModelOpt checkpoint running through vLLM, not a GGUF conversion. My setup is optimized mainly for long context on a single RTX 5090, not maximum single-stream tok/s.

Sign up or log in to comment