120 TPS on sglang - very nice indeed

#7
by bbouldin - opened

VERY happy with the performance of this quant on 2x A6000s (the older, non-ada ones).

I get ~120 TPS and it works very well for agentic coding (claude code, opencode, etc.).

I didn't realize how much difference AQW makes on the A6000 (ampere) architecture until now.

[2026-02-18 17:39:45 TP0] Decode batch, #running-req: 1, #full token: 41031, full token usage: 0.03, mamba num: 2, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 123.80, #queue-req: 0, 
[2026-02-18 17:39:45 TP0] Decode batch, #running-req: 1, #full token: 41071, full token usage: 0.03, mamba num: 2, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 118.82, #queue-req: 0, 
[2026-02-18 17:39:45 TP0] Decode batch, #running-req: 1, #full token: 41111, full token usage: 0.03, mamba num: 2, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 120.26, #queue-req: 0, 

I run it with:

python -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--cyankiwi--Qwen3-Coder-Next-AWQ-4bit/snapshots/fd002a98f69ddd8b6a864c46a4351c2ce55463ac/  --tp 2 --kv-cache-dtype fp8_e5m2 --trust-remote-code --disable-cuda-graph-padding --context-length 262144 --served-model-name qwen3-coder-next --tool-call-parser qwen3_coder --port 8000 --host 0.0.0.0'
nvidia-smi
Thu Feb 19 10:52:46 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000               Off |   00000000:21:00.0  On |                  Off |
| 33%   62C    P3             64W /  300W |   46321MiB /  49140MiB |     22%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A6000               Off |   00000000:61:00.0 Off |                  Off |
| 30%   41C    P5             25W /  300W |   42855MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Just wanted to share, in case any of this helps others.

Sign up or log in to comment