Qwen3-14B NVFP4A16 — vLLM / RTX 5060 Ti
This repository contains a locally quantized Qwen3-14B model using NVFP4A16 / compressed-tensors for vLLM inference on NVIDIA Blackwell GPUs.
All Linear layers (except lm_head) were quantized to NVFP4A16 using llmcompressor + compressed-tensors.
This model warns the 14B model likely requires more VRAM than a single RTX 5060 Ti 16GB can provide at full context. Multi-GPU or reduced context length is recommended.
This model card documents the local quantization and test performed on RTX 5060 Ti 16GB.
Quantization Summary
| Item | Value |
|---|---|
| Base model | Qwen/Qwen3-14B |
| Architecture | Qwen3ForCausalLM |
| Hidden size | 5120 |
| Layers | 40 |
| Attention heads | 40 (KV: 8) |
| Context length | 40,960 |
| Vocab size | 151,936 |
| Quantization format | NVFP4A16 |
| Compressed size | ~9.9 GB |
| Quantization config | compressed-tensors |
Tested Hardware
| Component | Configuration |
|---|---|
| GPU | NVIDIA GeForce RTX 5060 Ti 16 GB |
| CPU | Intel Xeon E5-2680 v4 |
| System RAM | 64 GB |
| Runtime | Docker + NVIDIA Container Runtime |
| Container image | vllm/vllm-openai:v0.22.0-ubuntu2404 |
Suggested vLLM Command
vllm serve /models/Qwen3-14B-NVFP4 \
--trust-remote-code \
--served-model-name Qwen3-14B \
--max-model-len 8192 \
--gpu-memory-utilization 0.93 \
--max-num-batched-tokens 4096 \
--max-num-seqs 2 \
--tensor-parallel-size 1 \
--enforce-eager \
--port 8000
Status
- Quantized to NVFP4A16 (compressed-tensors)
- Tested on RTX 5060 Ti 16 GB
- Tested with vLLM v0.22.0 Docker image
- Loads successfully in vLLM
- Downloads last month
- 10