Vezora
/

Qwen2.5-Coder-32B-Instruct-fp8-W8A16

Model card Files Files and versions Community

Vezora commited on Nov 29, 2024

Commit

fb1c114

·

verified ·

1 Parent(s): a4df890

Update README.md

Files changed (1) hide show

README.md +16 -1

README.md CHANGED Viewed

@@ -47,4 +47,19 @@ How to use the script:
 Have VLLM installed and run 'pip install llmcompressor==0.1.0'.
-Then literally run the script it will ask you for the model name enter it and it will do the rest **NOTE** this will use CPU ram to avoid OOM errors if you somehow on gods green earth have more GPU vram than CPU ram, edit the script to load to gpu.

 Have VLLM installed and run 'pip install llmcompressor==0.1.0'.
+Then literally run the script it will ask you for the model name enter it and it will do the rest **NOTE** this will use CPU ram to avoid OOM errors if you somehow on gods green earth have more GPU vram than CPU ram, edit the script to load to gpu.
+## How to Run
+To launch the API server for this model, use the following command:
+```bash
+python3 -m vllm.entrypoints.openai.api_server \
+    --model Vezora/Qwen2.5-Coder-32B-Instruct-fp8-W8A16 \
+    --dtype auto \
+    --api-key token-abc123 \
+    --quantization compressed-tensors \
+    --max-num-batched-tokens 32768 \
+    --max-model-len 32768 \
+    --tensor-parallel-size 2 \
+    --gpu-memory-utilization 0.99