DeepSeek R1 AWQ
AWQ of DeepSeek R1.
This quant modified some of the model code to fix an overflow issue when using float16.
To serve using vLLM with 8x 80GB GPUs, use the following command:
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ
The max model length flag ensures that KV cache usage won't be higher than available memory, the moe_wna16
kernel doubles the inference speed, but you must build vLLM from source as of 2025/2/3.
You can download the wheel I built for PyTorch 2.6, Python 3.12 by clicking here.
Inference speed with batch size 1 and short prompt:
- 8x H100: 34 TPS
- 8x A100: 27 TPS
- Downloads last month
- 10,106
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API does not yet support model repos that contain custom code.
Model tree for cognitivecomputations/DeepSeek-R1-AWQ
Base model
deepseek-ai/DeepSeek-R1