Instructions to use cyankiwi/Qwen3-Coder-Next-AWQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cyankiwi/Qwen3-Coder-Next-AWQ-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="cyankiwi/Qwen3-Coder-Next-AWQ-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("cyankiwi/Qwen3-Coder-Next-AWQ-4bit") model = AutoModelForCausalLM.from_pretrained("cyankiwi/Qwen3-Coder-Next-AWQ-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use cyankiwi/Qwen3-Coder-Next-AWQ-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cyankiwi/Qwen3-Coder-Next-AWQ-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/Qwen3-Coder-Next-AWQ-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/cyankiwi/Qwen3-Coder-Next-AWQ-4bit
- SGLang
How to use cyankiwi/Qwen3-Coder-Next-AWQ-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cyankiwi/Qwen3-Coder-Next-AWQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/Qwen3-Coder-Next-AWQ-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cyankiwi/Qwen3-Coder-Next-AWQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/Qwen3-Coder-Next-AWQ-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use cyankiwi/Qwen3-Coder-Next-AWQ-4bit with Docker Model Runner:
docker model run hf.co/cyankiwi/Qwen3-Coder-Next-AWQ-4bit
120 TPS on sglang - very nice indeed
#7
by bbouldin - opened
VERY happy with the performance of this quant on 2x A6000s (the older, non-ada ones).
I get ~120 TPS and it works very well for agentic coding (claude code, opencode, etc.).
I didn't realize how much difference AQW makes on the A6000 (ampere) architecture until now.
[2026-02-18 17:39:45 TP0] Decode batch, #running-req: 1, #full token: 41031, full token usage: 0.03, mamba num: 2, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 123.80, #queue-req: 0,
[2026-02-18 17:39:45 TP0] Decode batch, #running-req: 1, #full token: 41071, full token usage: 0.03, mamba num: 2, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 118.82, #queue-req: 0,
[2026-02-18 17:39:45 TP0] Decode batch, #running-req: 1, #full token: 41111, full token usage: 0.03, mamba num: 2, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 120.26, #queue-req: 0,
I run it with:
python -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--cyankiwi--Qwen3-Coder-Next-AWQ-4bit/snapshots/fd002a98f69ddd8b6a864c46a4351c2ce55463ac/ --tp 2 --kv-cache-dtype fp8_e5m2 --trust-remote-code --disable-cuda-graph-padding --context-length 262144 --served-model-name qwen3-coder-next --tool-call-parser qwen3_coder --port 8000 --host 0.0.0.0'
nvidia-smi
Thu Feb 19 10:52:46 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:21:00.0 On | Off |
| 33% 62C P3 64W / 300W | 46321MiB / 49140MiB | 22% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:61:00.0 Off | Off |
| 30% 41C P5 25W / 300W | 42855MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Just wanted to share, in case any of this helps others.