Instructions to use Qwen/Qwen3-Embedding-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Qwen/Qwen3-Embedding-4B with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Qwen/Qwen3-Embedding-4B") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Transformers
How to use Qwen/Qwen3-Embedding-4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Qwen/Qwen3-Embedding-4B")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-4B") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Embedding-4B") - Notebooks
- Google Colab
- Kaggle
Far different performance between Qwen3-4B and Qwen3-Embedding-4B
Context
We’ve been benchmarking Qwen3-Embedding-4B lately on vLLM v0.19.0 using genai-bench and getting perplexing and disappointing results. We ran bechmarking with a similarly sized, text generation model too (Qwen3-4B), and all other things being equal, performance for the text generation model is much better than the embedding model. No matter what parameters we adjust, on the embedding model runs:
- latency seems higher than it should be
- throughput is relatively low
- requests per second actually drop as concurrency rises
- we never get anywhere close to saturating the GPU
We greatly appreciate any guidance you all might be able to provide!
Questions
- Why is the GPU not saturated for the embedding model benchmark?
- Why do RPS drop when concurrency rises, only for the embedding model run?
- Is vLLM well suited in the first place for embedding models like Qwen3-Embedding-4B?
- What might be the most effective adjustments we can make to get better throughput, especially as concurrency rises?
Benchmark Metadata
Below are the metadata for both the text and the embedding model benchmark runs.
Qwen3-4B on g7e.2xlarge with D(8,50) (vLLM v0.19.0)
vLLM Params
“Qwen/Qwen3-4B”,
--served-model-name, "Qwen/Qwen3-4B", "Qwen-Qwen3-4B"
--uvicorn-log-level, "warning",
--gpu-memory-utilization, "0.95",
--quantization, "fp8",
--max-num-batched-tokens, "4096",
--max-num-seqs, "128",
--performance-mode, "interactivity"
genai-bench Command
genai-bench benchmark --api-backend vllm --api-base --api-model-name Qwen-Qwen3-4B --task text-to-text --api-key “x” --num-workers 16 --server-gpu-count 1 --model-tokenizer Qwen/Qwen3-4B --traffic-scenario “D(8,50)” --metrics-time-unit ms --spawn-rate 50 --max-requests-per-run 5000 --max-time-per-run 2
E2E Latencies
P50 – 348.3 milliseconds
P90 – 363.8 milliseconds
P99 – 404.7 milliseconds
Token Throughput per Sec
Input - 2793.7
Output - 8692.0
RPS - 173.8
GPU Utilization - 100%
Qwen3-Embedding-4B on g7e.2xlarge with E(8) (vLLM v0.19.0)
vLLM Params
“Qwen/Qwen3-Embedding-4B”,
--served-model-name, "Qwen/Qwen3-Embedding-4B", "Qwen-Qwen3-Embedding-4B"
--uvicorn-log-level, "warning",
--gpu-memory-utilization, "0.95",
--quantization, "fp8",
--max-num-batched-tokens, "4096",
--max-num-seqs, "128",
--performance-mode, "interactivity"
genai-bench Command
genai-bench benchmark --api-backend vllm --api-base --api-model-name Qwen-Qwen3-Embedding-4B --task text-to-embeddings --api-key “x” --num-workers 16 --server-gpu-count 1 --model-tokenizer Qwen/Qwen3-Embedding-4B --traffic-scenario “E(8)” --metrics-time-unit ms --spawn-rate 50 --max-requests-per-run 5000 --max-time-per-run 2
E2E Latencies
P50 – 98.1 milliseconds
P90 – 107.6 milliseconds
P99 – 126.9 milliseconds
Token Throughput per Sec
Input - 1858.9
Output - N/A
RPS - 8.5
GPU Utilization - 19%