Far different performance between Qwen3-4B and Qwen3-Embedding-4B

#26

by juliasi - opened Apr 27

Apr 27

Context

We’ve been benchmarking Qwen3-Embedding-4B lately on vLLM v0.19.0 using genai-bench and getting perplexing and disappointing results. We ran bechmarking with a similarly sized, text generation model too (Qwen3-4B), and all other things being equal, performance for the text generation model is much better than the embedding model. No matter what parameters we adjust, on the embedding model runs:

latency seems higher than it should be
throughput is relatively low
requests per second actually drop as concurrency rises
we never get anywhere close to saturating the GPU

We greatly appreciate any guidance you all might be able to provide!

Questions

Why is the GPU not saturated for the embedding model benchmark?
Why do RPS drop when concurrency rises, only for the embedding model run?
Is vLLM well suited in the first place for embedding models like Qwen3-Embedding-4B?
What might be the most effective adjustments we can make to get better throughput, especially as concurrency rises?

Benchmark Metadata

Below are the metadata for both the text and the embedding model benchmark runs.

Qwen3-4B on g7e.2xlarge with D(8,50) (vLLM v0.19.0)

vLLM Params

“Qwen/Qwen3-4B”,
    --served-model-name, "Qwen/Qwen3-4B", "Qwen-Qwen3-4B"
    --uvicorn-log-level, "warning",
    --gpu-memory-utilization, "0.95",
    --quantization, "fp8",
    --max-num-batched-tokens, "4096",
    --max-num-seqs, "128",
    --performance-mode, "interactivity"

genai-bench Command

genai-bench benchmark --api-backend vllm --api-base --api-model-name Qwen-Qwen3-4B --task text-to-text --api-key “x” --num-workers 16 --server-gpu-count 1 --model-tokenizer Qwen/Qwen3-4B --traffic-scenario “D(8,50)” --metrics-time-unit ms --spawn-rate 50 --max-requests-per-run 5000 --max-time-per-run 2

E2E Latencies

P50 – 348.3 milliseconds
P90 – 363.8 milliseconds
P99 – 404.7 milliseconds

Token Throughput per Sec

Input - 2793.7
Output - 8692.0

RPS - 173.8
GPU Utilization - 100%

Qwen3-Embedding-4B on g7e.2xlarge with E(8) (vLLM v0.19.0)

vLLM Params

“Qwen/Qwen3-Embedding-4B”,
    --served-model-name, "Qwen/Qwen3-Embedding-4B", "Qwen-Qwen3-Embedding-4B"
    --uvicorn-log-level, "warning",
    --gpu-memory-utilization, "0.95",
    --quantization, "fp8",
    --max-num-batched-tokens, "4096",
    --max-num-seqs, "128",
    --performance-mode, "interactivity"

genai-bench Command

genai-bench benchmark --api-backend vllm --api-base --api-model-name Qwen-Qwen3-Embedding-4B --task text-to-embeddings --api-key “x” --num-workers 16 --server-gpu-count 1 --model-tokenizer Qwen/Qwen3-Embedding-4B --traffic-scenario “E(8)” --metrics-time-unit ms --spawn-rate 50 --max-requests-per-run 5000 --max-time-per-run 2

E2E Latencies

P50 – 98.1 milliseconds
P90 – 107.6 milliseconds
P99 – 126.9 milliseconds

Token Throughput per Sec

Input - 1858.9
Output - N/A

RPS - 8.5
GPU Utilization - 19%

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment